Text Mining

2019/2020

Document summarization

Basically, we can regard the “summarization” as the “function” its input is document and output is summary.

There are mainly two ways to make the summary. Extractive and Abstractive.

Extractive: It involves the selection of phrases and sentences from the source document to make up the new summary.
Abstractive: It involves generating entirely new phrases and sentences to capture the meaning of the source document.

Extractive & Abstractive is not conflicting ways. You can use both to generate the summary. Here we cover extractive methods and, in particular, the TextRank algorithm.

TextRank

TextRank is an algorithm based on PageRank, which is often used in keyword extraction and text summarization.

PageRank is the first and best known algorithm used by Google to rank web pages in their search engine results. PageRank was named after Larry Page. PageRank is a way of measuring the importance of website pages. According to Google:

“PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.”

PageRank

PageRank is an algorithm used to calculate the weight for web pages. We can take all web pages as a big directed graph. In this graph, a node is a webpage. If webpage a has the link to web page e, it can be represented as a directed edge from a to e.

We can use a matrix to represent the inbound and outbound links among a, b, e, f in the graph.

Each node in a row means the inbound links from other nodes. For example, for the e row, node a and b have outbound links to node e.

To account for pages with many outbound links, we should normalize each column.

We use this matrix to multiply with the weight of all nodes. In the initialization, the importance of each node is \(1\). We need to run this iteration many times to get the final weights.

Back to TextRank

What is the difference between TextRank and PageTank?

The simple answer is PageRank is for webpage ranking, and TextRank is for text ranking. The webpage in PageRank is the text in TextRank, so the basic idea is the same.

Define nodes (words, sentences)
Create a graph (link nodes)
Calculate the weight for each node (PageRank)
Extract the most important nodes (summary)

Keyword Extraction

Each word is a node in PageRank
Any two-word pairs in an \(n\)-gram are considered have a bidirectional link
Based on this graph, calculate the weight for each word
The most important words can be used as keywords

Sentence Extraction

Tokenize words in each sentence
Build a Similarity matrix (e.g. cosine similarity)
Based on this graph, calculate the weight for each sentence
The most important words can be used as summary

Take Home Concepts

Document summarization is the process of shortening a set of data computationally
There are mainly two ways to make the summary. Extractive and Abstractive.
TextRank is based on PageRank
Keyword extraction with TextRank
Sentence extraction with TextRank

Document summarization

TextRank

TextRank

PageRank

Back to TextRank

Keyword Extraction

Sentence Extraction

Take Home Concepts

Take Home Concepts

References

References