Automatic Summarization The Current State Of Art

Published Date: 02 Nov 2017

Artificial Intelligences vital aim is to build machine that will imitate human intelligence [2]. Automatic summarization is however a highly interdisciplinary research area related with artificial intelligence and linguistic knowledge [7]. These days the practical need for automatic summarization has become increasingly urgent and in todays era, it has drawn the attention of several researchers. Automatic summarization helps importantly in handling huge amount of text and Information Retrieval (IR) in any document and compressing it into the kind of data which is meaningful. Many articles have also applied their approach to new articles, taking advantages of free resources such as WordNet. This article attempts to provide a comprehensive overview of all researches in summarization including the two major concepts under summarization- extraction and abstraction. We also discuss the various approaches, application areas and challenges that remain open that would be necessary in future advances in the field.

Keywords- Abstraction, Automatic summarization, Extraction, Text mining.

I. INTRODUCTION

Automatic summarization was first introduced by Luhn some 50 years ago which is to extract key sentences to represent original documents. The existing summarization methods can be divided into two categories as mentioned earlier- one is to extract summarization based on statistics; the other is to abstract the summarization. Extract summarization means to directly extract sentences from original documents to construct summary. This kind of approach does not need parsing and semantic analysis and the generated summaries are easy to be understood [3]. Abstract summarization generates summaries that are accurate but impractical because they are based on the understanding of documents by using parsing and semantic analysis.

The technology of automatic summarization is maturing and may provide a solution to information overload problem [4]. With a large volume of documents, providing the user with a summary of each document greatly facilitates the task of finding the desired information. Summarization consists of three broad categories- text summarization, multi-document summarization and multimedia summarization. Text summarization is the process of automatically creating a compressed version of a given text, providing useful information to user. Multi-document summarization is to produce a summary delivering the majority of information content from a set of document about an explicit or implicit main topic. Multimedia summarization deals with the summarization of the text, audio as well as video contents of the multimedia document which is however a limited field of research in present time.

To summarize any document the reader has to have a good understanding of the text in the document and integrate the retrieved information and make connection across sentences to form sensible discourse representation [7]. In document summarization, syntax is present in order to allow the construction of formal structures by using which meanings can be expressed, but phrases form not only syntactic units but also semantic units. This article gives an overview of past efforts and also summarization tasks and necessary system components.

II. KNOWLEDGE DISCOVERY

Knowledge Discovery in Databases (KDD) is the process of identifying novel and understandable patterns in data (Han and Kamber, 2000; Witten and Frank, 1999). The task of KDD is to find information or answers to the questions which the user already knows to ask and also to discover the deep knowledge embedded within the data. In order to do this, knowledge discovery applies techniques usually in the form of an algorithm, to find potentially important algorithm in the data.

KDD and Data Mining has attracted a huge attention due to the desired need for turning the digital data into information and knowledge. Market analysis and business management are one of the many applications that benefit from the use of information and knowledge that is extracted from a large amount of data [12].

Knowledge discovery, as the word suggests, is a process of non-trivial extraction of information that is presented inside the data which is unknown but is potentially important for users. One of the major applications of KDD is in the field of

text mining which deals with recognition of various patterns in any text document.

III. TEXT MINING

Text mining is the process of identifying useful or interesting patterns, models, trends or directions from unstructured text, which is an application of knowledge discovery in databases. However, data mining focuses on the well structured collections existing in relational databases or data warehouses whereas text mining excavates data that is far less structured.

Text mining is a challenging task because it deals with searching accurate information or features in documents to help users discover interesting knowledge. Automatic summarization is an extension of text mining which is used to produce a summary of a document in order to reduce the efforts of the user.

IV. SUMMARIZATION

The increasing availability of online information has made intensive research in the area of Automatic summarization necessary within the Natural Language Processing (NLP) community [8]. Now we introduce some commonly used terms in summarization- extraction is the process of identifying and retrieving important sections of the text and producing them verbatim; abstraction aims to produce important material in a new fashion; fusion combines all the extracted texts coherently; compression aims to throw out useless and unimportant sections of the text. Both abstractive and extractive approaches of summarization have been attempted depending upon the different situations. The abstractive summarization usually requires heavy machinery for generation of language and is harder to replicate and extend to wider domains. In contrast to abstraction, extraction has produced satisfactory results in large scale applications, especially in multi-document summarization.

A summary can be defined as a text that is produced from one or more texts, that convey useful, important information in the original text(s), and whose length is no longer than the half of the length of the original text(s) and usually significantly less than that [9]. Thus, the definition captures three important aspects that characterize research on Automatic summarization:-

? Summary should be meaningful, should make sense and point out the salient details.

? Summary should be responsive and rich in quality.

? Summary should be as short as possible.

The three broad categories of Automatic summarization have been discussed below:

A. Single Document Summarization

Single document summarization deals with one text document from which major information is retrieved. It is the basic step towards summarization and can be performed using extraction as well as abstraction. In a single document summarization the summary revolves around a specific main topic which is consistent throughout unlike the multi-document summarization where one might have to face multiple topics that are loosely related. Single document summarization systems process documents one at a time.

One of the major signature themes in summarization of single document is the use of discourse to identify the importance of a sentence [1]. The importance of any sentence is determined by evaluating the connection of the sentence with the document theme whether through co referential or lexical chains. The lexical cohesion devices use the concept of repetition to find the bonds between sentences [2]. These strongly bonded sentences are used to form a summary whose length usually is controlled by the extent of bond strength.

The main aim of text summarization is to present the important details in a short version of the original text document to the user that will help the user to retrieve the major information quickly.

B. Multi-document Summarization

Multi-document summarization systems process documents simultaneously more than one time [4]. They are generally used to process more than one documents that are thematically related.

There are two types of situations in which multi-document summarization is applied- First, when there is a collection of non-similar documents and one wants to consider the important features present in the set of documents or Second, there is a set of documents related by a common topic which may be retrieved from a larger collection or closely related cluster of topic. Considering the first case, since the set is very large, it can be summarized by individually summarizing each document and then integrating the summaries and again summarizing this cluster. In the second situation, a summary can be constructed which will contain all the main points of the topic.

An ideal multi-document summary would be able to accommodate various levels of information and detail which is difficult without natural language understanding.

C. Multimedia Summarization

Under the circumstance where a large number of multimedia activities are happening at the same time over the Internet, it has become very important to be able to effectively sort, organise and examine the large stores of information in a way such that they go beyond surfing, browsing or integrated filtering.

Multimedia summarization systems analyse various information sources and combining all multimedia document

with an aim to retrieve a meaningful abstract [14]. It is divided into three categories:

1) Internal Technique: This technique exploits low level features of audio, video and text.

2) External Technique: It deals with the information related to viewing activity and interaction with user.

3) Hybrid Technique: It is a combination of both internal and external information.

In multimedia summarization it becomes very important to retrieve not only the important data but also to bring meaning into the synchronised video, audio and text information.

V. APPROACHES

Now, we briefly introduce the major approaches under the current state of art of Automatic summarization which will basically include three sub-topics: Extraction, Abstraction and Evaluation.

A. Extraction

The major aim of automatic summarization based on the approach of extraction is sentence selection. Text extraction has had a longer history than text summarization starting with the first study by Luhn in 1958 [6]. Traditionally text extraction systems were formed because of the advancing need to handle huge amounts of data together and used simple techniques, without any symbolic or linguistic processing. Thus, the extracts produced are limited in terms of quality. The summaries serve as a quick guide to interesting information and hence their importance in day todays life. In text extraction, sentences or phrases are selected from original document with the highest score and integrated into a shorter text without hampering the meaning of the source text. Most of the current automated text summarization systems use the extraction approach for producing summary.

Summarization by extraction includes identification of important features such as length of the sentence, its location, frequency of terms, number of occurrences of words in the title, number of numerical data, proper noun frequency etc.

Earlier in summarization systems, summaries were produced on the basis of occurrence of most frequent words in the text.

B. Abstraction

Extraction systems are merely based on the technique of copying the most frequently used information in the text document whereas abstraction includes the paraphrasing sections of the original document. Abstraction condenses text more strongly than extraction but it requires a strong understanding of the natural language generation technology which itself is ever expanding.

Abstracts are provided in order to covey the main information in the input and may reuse the sentences phrases or clauses from original document but are majorly expressed in the words of the author. Abstraction technique uses the combination of artificial intelligence and NLP.

Abstractive approaches use information extraction, ontological information, information fusion, and compression. Information extraction approaches can be characterized as downward since they apply for a set of predefined information types to include in the summary whereas extractive approaches are more data-driven [13]. For each topic, the user predefines frames of expected information types, together with recognition criteria. The summarization engine then locates the desired pieces of information, fills them in, and generates a summary with the results (DeJong 1978; Rau and Jacobs 1991). This method has the capability to produce high-quality and accurate summaries, although this concept is applied in restricted domains only.

Compressive summarization results from approaching the problem from the aspect of language generation. Using the smallest units from the original document, Witbrock and Mittal (1999) extract a set of words from the input document and then order the words into sentences using a bigram language model. Jing and McKeown (1999) have pointed out that human summaries are often constructed from the source document by a process of cutting and pasting document fragments that are then combined and regenerated as summary sentences. Therefore, an algorithm in abstraction can be developed to extract sentences, reduce them by dropping unimportant fragments, and then use information fusion and generation to combine the remaining fragments.

However, true abstraction involves taking the process to a next level. Abstraction involves understanding that a set of extracted passages together constitute something new, something that has a new meaning and something that cannot be found in the source, and then replacing them in the summary with the new concept(s).

C. Evaluation

Evaluation deals with evaluating the quality of a summary and has indeed been the most challenging problem, obviously since there is no ideal summary. Even for relatively straightforward news articles, human summarizers tend to agree only approximately 60% of the time measuring sentence content overlap [13].

There have been multiple algorithms and models for evaluation systems that could reduce the problem of accurateness, but researchers are trying to look at new methods that can produce more acceptable models.

The largest task-oriented evaluation to date, the Summarization Evaluation Conference (SUMMAC) (Mani et al. 1998; Firmin and Chrzanowski 1999) included three tests:

? The categorization task which deals with how well humans can categorize a summary compared to its full text.

? The ad hoc task which deals with how well humans can determine whether a full text is relevant to a query just from reading the summary.

? The question task which deals with how well can humans answer questions about the main thrust of the source text from reading just the summary.

However the interpretation of the results is not very easy and researchers are still working on the problems of people agreeing on the same concept since each individual carries a different perspective.

.

VI. APPLICATION AREAS

These days single document and multi-document summarization is of great importance in any industry, any field of technology. With increase in jobs and activities, an individuals life has become very compact with schedules and routines and thus it is not possible for an individual to find time in going through any document(s) thoroughly. Automatic summarization helps save time by pointing out the important details for the user.

Forming headlines for various news articles is one of the major applications of automatic summarization which helps condense a particular article of given length into a single line and convey the salient information to the reader [1].

Also, as social networking grows, summaries are useful to track down a network, find needed information, determine who talks to who etc [1]. The new forms of media such as blog and chats are also gaining a lot of attention which include text and speech but separated by time where summarization has started to play a vital role but it has its own problems in this field related to slangs and informal languages which are becoming an attraction for summary researchers. Automatic summarization has helped in forming search engines online that help to retrieve data on a particular query that the user has asked for.

Multimedia summarization in todays time is also being presented in order to allow judicial actors to browse and navigate multimedia documents related to penal hearings/proceedings [14].

VII. CHALLENGES

The field of automatic summarization is very vast and though many papers have been published by various researchers around the globe, it continues to remain open to various challenges. Since the domain of Artificial Intelligence (AI) deals with the mimicry of human intelligence, it is practically impossible to achieve a cent percent efficiency in algorithms for summarization. The approaches of extraction and abstraction are being applied well but still extraction approach to summarize any document happens to dominate this field [1]. And as we know that extraction can very well handle the syntactic details of a document but the semantic details are less taken care of. Thus, the major challenge in building a summary is not about the grammar or syntax but most importantly to bring meaning into the summary in a way such that it is useful for the user.

Apart from just single document and multi-document summarization, it is necessary to consider the needs of the user and what exactly he/she is asking for. This kind of summarization is known as query-focused summarization which is carried out in both IR and Web-based question-answer contexts.

In recent works, researchers have looked at how to generate sentence that convey the integration of information found in different sentences which is helpful in question answering. This field however demands researchers to develop new approaches and hence is very important in the domain of summarization.

One of the aspects that needs to be considered during automatic summarization is that a good summary not only aims at producing the key points, but it also needs to arrange the various details in different levels of their importance that will make it more understandable and reflect upon its quality.

Another challenge in summarization on social networking sites is the use of informal language and in the case of chat, many abbreviations which are appealing to the new generation but are very different from the kind of language that the researches handle. Summarization in this area has just started to appear and more research is in progress.

The new summary that is formed using abstraction approach sometimes has a new concept which is not explicitly available in the document [13]. This would mean that the abstraction system has to have an access to some external information of some kind, such as an ontology or knowledge base and should be able to perform combinatory interference. And, because of the lack of resources based on this logic, abstractive summarization has not progressed.

VIII. CONCLUSION

There is immense amount of information available on the web which is ever-growing [3]. When the user enters given query, the search engines also end up producing a large number of documents as an output and therefore are unable to provide precise information and so it becomes difficult for the user to find what he has exactly asked for and no one in the present times has enough patience or enough time to read each and every document. Hence, automatic summarization has become increasingly desirable in todays time.

There have been many advances in this field since 1960s. New methods for salience extraction have also come up which include graph-based methods which find important parts by links between words and sentences and then utilising the concept between sentences.

Many published researchers these days are taking help of free resources such as WordNet which has proven to be a very powerful tool in summarization. WordNet is an electronic lexical database developed at Princeton University (Miller et al., 1993) [4]. Words of the same syntactic category that can be used to express a concept are grouped into same sets.

Summarization research in specific domains often makes use of more sophisticated resources and often has access to semantic resources. Though they may use sentence extraction but their methods to find key points are obviously quite different. However, a major amount of research is required in semantic or discourse knowledge since these areas have great potential for more advancement in automatic summarization techniques.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now