Automatic summarization

Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.

Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.

Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured.

Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc.

[14] In the case of research articles, many authors provide manually assigned keywords, but most text lacks pre-existing keyphrases.

They can enable document browsing by providing a short summary, improve information retrieval (if documents have keyphrases assigned, a user could search by keyphrase to produce more reliable hits than a full-text search), and be employed in generating index entries for a large text corpus.

Depending on the different literature and the definition of key terms, words or phrases, keyword extraction is a highly related theme.

Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part-of-speech tags.

Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney's seminal paper.

Ensemble methods (i.e., using votes from several classifiers) have been used to produce numeric scores that can be thresholded to provide a user-provided number of keyphrases.

The genetic algorithm optimizes parameters for these heuristics with respect to performance on training documents with known key phrases.

While supervised methods have some nice properties, like being able to produce interpretable rules for what features characterize a keyphrase, they also require a large amount of training data.

These edges build on the notion of "text cohesion" and the idea that words that appear near each other are likely related in a meaningful way and "recommend" each other to the reader.

During the DUC 2001 and 2002 evaluation workshops, TNO developed a sentence extraction system for multi-document summarization in the news domain.

Although the system exhibited good results, the researchers wanted to explore the effectiveness of a maximum entropy (ME) classifier for the meeting summarization task, as ME is known to be robust against feature dependencies.

The two methods were developed by different groups at the same time, and LexRank simply focused on summarization, but could just as easily be used for keyphrase extraction or any other NLP ranking task.

It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was used as part of a larger summarization system (MEAD) that combines the LexRank score (stationary probability) with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights.

Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic.

In such a way, multi-document summarization systems are complementing the news aggregators performing the next step down the road of coping with information overload.

Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased.

The idea of a submodular set function has recently emerged as a powerful modeling tool for various summarization problems.

[25] Moreover, the greedy algorithm is extremely simple to implement and can scale to large datasets, which is very important for summarization problems.

For example, work by Lin and Bilmes, 2012[26] shows that submodular functions achieve the best results to date on DUC-04, DUC-05, DUC-06 and DUC-07 systems for document summarization.

Similarly, work by Lin and Bilmes, 2011,[27] shows that many existing systems for automatic summarization are instances of submodular functions.

Tschiatschek et al., 2014 show[28] that mixtures of submodular functions achieve state-of-the-art results for image collection summarization.

Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc.

Human judgement often varies greatly in what it considers a "good" summary, so creating an automatic evaluation process is particularly difficult.

[38] Domain-independent summarization techniques apply sets of general features to identify information-rich text segments.

In the following year it was surpassed by latent semantic analysis (LSA) combined with non-negative matrix factorization (NMF).

Although they did not replace other approaches and are often combined with them, by 2019 machine learning methods dominated the extractive summarization of single documents, which was considered to be nearing maturity.