=================================== Tutorial: Extractive Summarization =================================== This tutorial will go through the implementation of several extractive summarization models with spaCy. An *extractive* summarization system is a filter over the original document/s: most of the text is removed, and the remaining text is formatted as a summary. In contrast, an *abstractive* summarization system generates new text. Application Context ------------------- Extractive summarization systems need an application context. We can't ask how to design the system without some concept of what sort of summary will be useful for a given application. (Contrast with speech recognition, where a notion of "correct" is much less application-sensitive.) For this, I've adopted the application context that `Flipboard`_ discuss in a recent blog post: they want to display lead-text to readers on mobile devices, so that readers can easily choose interesting links. I've chosen this application context for two reasons. First, `Flipboard`_ say they're putting something like this into production. Second, there's a ready source of evaluation data. We can look at the lead-text that human editors have chosen, and evaluate whether our automatic system chooses similar text. Experimental Setup ------------------ Instead of scraping data, I'm using articles from the New York Times Annotated Corpus, which is a handy dump of XML-annotated articles distributed by the LDC. The annotations come with a field named "online lead paragraph". Our summarization systems will be evaluated on their Rouge-1 overlap with this field. Further details of the experimental setup can be found in the appendices. .. _newyorktimes.com: http://newyorktimes.com .. _Flipboard: http://engineering.flipboard.com/2014/10/summarization/ .. _vector-space model: https://en.wikipedia.org/wiki/Vector_space_model .. _LexRank algorithm: https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html .. _PageRank: https://en.wikipedia.org/wiki/PageRank Summarizer API -------------- Each summarization model will have the following API: .. py:func:`summarize(nlp: spacy.en.English, headline: unicode, paragraphs: List[unicode], target_length: int) --> summary: unicode We receive the headline and a list of paragraphs, and a target length. We have to produce a block of text where len(text) < target_length. We want summaries that users will click-on, and not bounce back out of. Long-term, we want summaries that would keep people using the app. Baselines: Truncate ------------------- .. code:: python def truncate_chars(nlp, headline, paragraphs, target_length): text = ' '.join(paragraphs) return text[:target_length - 3] + '...' def truncate_words(nlp, headline, paragraphs, target_length): text = ' '.join(paragraphs) tokens = text.split() summary = [] n_words = 0 n_chars = 0 while n_chars < target_length - 3: n_chars += len(tokens[n_words]) n_chars += 1 # Space n_words += 1 return ' '.join(tokens[:n_words]) + '...' def truncate_sentences(nlp, headline, paragraphs, target_length): sentences = [] summary = '' for para in paragraphs: tokens = nlp(para) for sentence in tokens.sentences(): if len(summary) + len(sentence) >= target_length: return summary summary += str(sentence) return summary I'd be surprised if Flipboard never had something like this in production. Details like lead-text take a while to float up the priority list. This strategy also has the advantage of transparency: it's obvious to users how the decision is being made, so nobody is likely to complain about the feature if it works this way. Instead of cutting off the text mid-word, we can tokenize the text, and +----------------+-----------+ | System | Rouge-1 R | +----------------+-----------+ | Truncate chars | 69.3 | +----------------+-----------+ | Truncate words | 69.8 | +----------------+-----------+ | Truncate sents | 48.5 | +----------------+-----------+ Sentence Vectors ---------------- A simple bag-of-words model can be created using the `count_by` method, which produces a dictionary of frequencies, keyed by string IDs: .. code:: python >>> from spacy.en import English >>> from spacy.en.attrs import SIC >>> nlp = English() >>> tokens = nlp(u'a a a. b b b b.') >>> tokens.count_by(SIC) {41L: 4, 11L: 3, 5L: 2} >>> [s.count_by(SIC) for s in tokens.sentences()] [{11L: 3, 5L: 1}, {41L: 4, 5L: 1}] Similar functionality is provided by `scikit-learn`_, but with a different style of API design. With spaCy, functions generally have more limited responsibility. The advantage of this is that spaCy's APIs are much simpler, and it's often easier to compose functions in a more flexible way. One particularly powerful feature of spaCy is its support for `word embeddings`_ --- the dense vectors introduced by deep learning models, and now commonly produced by `word2vec`_ and related systems. Once a set of word embeddings has been installed, the vectors are available from any token: >>> from spacy.en import English >>> from spacy.en.attrs import SIC >>> from scipy.spatial.distance import cosine >>> nlp = English() >>> tokens = nlp(u'Apple banana Batman hero') >>> cosine(tokens[0].vec, tokens[1].vec) .. _word embeddings: https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ .. _word2vec: https://code.google.com/p/word2vec/ .. code:: python def main(db_loc, output_dir, feat_type="tfidf"): nlp = spacy.en.English() # Read stop list and make TF-IDF weights --- data needed for the # feature extraction. with open(stops_loc) as file_: stop_words = set(nlp.vocab.strings[word.strip()] for word in file_) idf_weights = get_idf_weights(nlp, iter_docs(db_loc)) if feat_type == 'tfidf': feature_extractor = tfidf_extractor(stop_words, idf_weights) elif feat_type == 'vec': feature_extractor = vec_extractor(stop_words, idf_weights) for i, text in enumerate(iter_docs(db_loc)): tokens = nlp(body) sentences = tokens.sentences() summary = summarize(sentences, feature_extractor) write_output(summary, output_dir, i) .. _scikit-learn: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text The LexRank Algorithm ---------------------- LexRank is described as a graph-based algorithm, derived from `Google's PageRank`_. The nodes are sentences, and the edges are the similarities between one sentence and another. The "graph" is fully-connected, and its edges are undirected --- so, it's natural to represent this as a matrix: .. code:: python from scipy.spatial.distance import cosine import numpy def lexrank(sent_vectors): n = len(sent_vectors) # Build the cosine similarity matrix matrix = numpy.ndarray(shape=(n, n)) for i in range(n): for j in range(n): matrix[i, j] = cosine(sent_vectors[i], sent_vectors[j]) # Normalize for i in range(n): matrix[i] /= sum(matrix[i]) return _pagerank(matrix) The rows are normalized (i.e. rows sum to 1), allowing the PageRank algorithm to be applied. Unfortunately the PageRank implementation is rather opaque --- it's easier to just read the Wikipedia page: .. code:: python def _pagerank(matrix, d=0.85): # This is admittedly opaque --- just read the Wikipedia page. n = len(matrix) rank = numpy.ones(shape=(n,)) / n new_rank = numpy.zeros(shape=(n,)) while not _has_converged(rank, new_rank): rank, new_rank = new_rank, rank for i in range(n): new_rank[i] = ((1.0 - d) / n) + (d * sum(rank * matrix[i])) return rank def _has_converged(x, y, epsilon=0.0001): return all(abs(x[i] - y[i]) < epsilon for i in range(n)) Initial Processing ------------------ Feature Extraction ------------------ .. code:: python def sentence_vectors(sentence, idf_weights): tf_idf = {} for term, freq in sent.count_by(LEMMA).items(): tf_idf[term] = freq * idf_weights[term] vectors.append(tf_idf) return vectors The LexRank paper models each sentence as a bag-of-words This is simple and fairly standard, but often gives underwhelming results. My idea is to instead calculate vectors from `word-embeddings`_, which have been one of the exciting outcomes of the recent work on deep-learning. I had a quick look at the literature, and found a `recent workshop paper`_ that suggested the idea was plausible. Taking the feature representation and similarity function as parameters, the LexRank function looks like this: Given a list of N sentences, a function that maps a sentence to a feature vector, and a function that computes a similarity measure of two feature vectors, this produces a vector of N floats, which indicate how well each sentence represents the document as a whole. .. _Rouge: https://en.wikipedia.org/wiki/ROUGE_%28metric%29 .. _word embeddings: https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/ .. _recent workshop paper: https://www.aclweb.org/anthology/W/W14/W14-1504.pdf Document Model --------------