A Query Based Arabic Text Summarization

Published Date: 02 Nov 2017

Abstractâ€”With the problem of increased web resources and the huge amount of information available, the necessity of having automatic summarization systems appeared. Since summarization is needed the most in the process of searching for information on the web, where the user aims at a certain domain of interest according to his query, in this case domain-based summaries would serve the best. Despite the existence of plenty of research work in the domain-based summarization in English, there is lack of them in Arabic due to the shortage of existing knowledge bases. In this paper we introduce a query based Arabic text summarization using an existing Arabic language thesaurus and an extracted knowledge base. We use an Arabic corpus to extract domain knowledge represented by topic related concepts/keywords and the lexical relations among them. The user's query is expanded once by using the Arabic WordNet thesaurus and then by adding the domain specific knowledge base to the expansion. For the summarization dataset, Essex Arabic Summaries Corpuswas used. It has many topic based articles with multiple human summaries. The performance appeared to be enhanced when using our extracted knowledge base than to just use the WordNet.

Keywordsâ€” Arabic text summarization, Knowledge-based summarization, Query expansion, Ontology extraction from text, Arabic WordNet.

Introduction

Due to the increased access of data on the web, it became harder to understand a certain topic without doing an effort of reading long documents and going through a lot of web pages to determine the most relevant ones. There came the need for automatic systems that would save the user's time, such as document clustering software, automatic summarizer, data mining software, etc. A summary can be defined as a text that is produced from one or more texts, that convey important information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that. The main goal of a summary is to present the main ideas in a document in less space [1].

When an online search engine is used, the results are expected to be of a certain domain specified by the given query. Thus a system that summarizes documents based on their specific domain, determined by a query, can be of a great help. Many work have been done in the field of knowledge-based summarization, some of them are multi-document summarizers, and the others are single-document ones. Some approaches uses encyclopedic knowledge in Wikipedia, where the expanded query is linked with its associated documents through spreading activation in a graph that represents words and their grammatical connections in these documents [1]. Hovy and Lin (1999) used a lexical thesaurus WordNet to generalize concepts and thus to identify the topics of the text in a summarization system SUMMARIST [2].Some others use an existing ontology as their reference knowledge base such as [5].There is a noticed lack of knowledge-based Arabic summarizers, due to the lack of existing knowledge bases.

In this paper we adapted a query based summarization approach that is aimed originally at the medical field, but the approach showed potential to be used in other information retrieval areas. The approach expands the user query using an existing medical ontology knowledge source, the user is asked to finalize the expanded query,and then the summary is produced by comparing each sentence in the original document with the original and the expanded query, scoring each sentence and finally includes the ones with the highest scores in the summary [5].

Our system deals with Arabic news articles and due to the lack of Arabic ontologies in general, we used the Arabic WordNet (AWN) to provide us with the words' lexical synonyms to expand the user's query. We also developed our own knowledge base consisting of the domain ontology concepts and relations among them. The user chooses the desired domainand the corpus is collected from the World Wide Web, the knowledge base is built using the existing corpus, theuser's query is expanded and finally the summary is proposed.

Review

Before explaining how the system works, some main aspects should be explained.

2.1. Ontology

The ontology is an explicit, formal specification of a shared conceptualization of a domain of interest. It should be restricted to a given domain of interest and therefore model concepts and relations that are relevant to a particular task or application domain [6]. Ontologies provide a richer knowledge representationthat improves machine interpretation of data [7].

Manual acquisition of ontologies is a tedious and cumbersome task. It requires an extended knowledge of a domain and in most cases the result could be incomplete or inaccurate. Manually built ontologies are expensive, error-prone, and biased towards their developer. Researchers try to overcome these disadvantages of manual building of ontologies by using semi-automatic or automaticmethods for building the ontology. Automation of ontologyconstruction not only reduces costs, but also results in anontology that better matches its application. During the lastdecade, several ontology learning approaches and systems havebeen proposed. They try to build ontology by two ways. Oneway is developing tools that are used by knowledge engineersor domain experts to build the ontology like ProtÃ©gÃ© and Jena, they are called the ontology modeling tools. Another way is semi-automatic or automaticbuilding of ontologies by learning it from different informationsources [8]. In the next two sub-sections we will talk about the two methodologies for automatic or semi-automatic ontology building.

2.2. Ontology learning from text

Ontology learning refers to extracting ontological elements (conceptual knowledge) from input and building ontology from them. It aims at semi-automatically or automatically building ontologies from a given text corpus with a limited human exert. The ontology building can be from scratch (automatic), or by adapting an existing ontology in a semi-automatic fashion using several sources [8].

Text or unstructured data is the most difficult type to learn from. It needs more processing than the semi-structured or structured data. The systems which have been proposed for learning from free text oftenconsist of the following four main processes, although they differ in the methodology of each process:

NLP:An ontology extractor from text must perform some NLP processes on the corpus to be able to extract knowledge from it. In a matter of fact some pre-processing should be applied on the texts before NLP is, such as removing abbreviations, numbers, words that don't belong to the ontology language, diacritics (ØªØ´ÙƒÙŠÙ„) in case of Arabic, etc. NLP processes include POS taggers, parsers (shallow or dependency), NER (Named Entity Recognizer), removing stop words, and stemming or lemmatizing.

Concept extraction: Concept or keyword extraction can be described as the task to identify a small set of words, key phrases, keywords, or key segments from a document that can describe the meaning of the document [9]. In [9], the existing methods for keyword extraction were divided into four categories, (a) simple statistics such as term frequency. (b) Linguistic analysis such as POS tagging, the analysis are mostly combined with statistical measures. (c) Machine learning where a set of training documents are provided to the system which has a range of human-chosen keywords, then the gained knowledge is applied to find keywords from new documents. (d) Mixed approaches combines the methods mentioned or use some heuristic knowledge such as the position, length, layout features of the words, etc. These approaches can be applied on both, single word and multi-word concept extraction.

Relation extraction: Ontologies, besides having a list of concepts should define the relations among concepts. Several researchers have attempted to find taxonomic relations expressed explicitly in texts by matching certain patterns which is referred to as Hearst-patterns. Other researchers have used the internal structure of noun phrases to find taxonomic relations [10]. Some relations are determined through the dependency parsers, "is-a" or "part-of" relations.

Ontology or hierarchy building: Some of the ontology extraction tools have to have reference ontology to update it with the new concepts and relations it deducted. Other tools use association rules of concepts and relations to be able to remove redundancies in the produced ontology hierarchy. And finally some tools use FCA formal concept analysis, which is a principled way of deriving a concept hierarchy or formal ontology from a collection of objects and their properties [11].

There are some available tools that extract ontology from text, such as Text-To-Onto, and its successor text2Onto, OntoLearn, protÃ©gÃ© plugin OntoLT, TERMINAE and some other done by researchers such as CRCTOL and Automaticconstruction of ontology from Arabic texts. It's worth mentioning that none of them supports the Arabic language except for the last one and it tries to build ontology for the whole Arabic language, it also requires an existing ontology to update it with the rules it extracted.

2.3. AWN (Arabic WordNet)

WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. Though WordNet contains a sufficiently wide range of common words, it does not cover special domain vocabulary, since it is primarily designed to act as an underlying database for different applications [12]. The AWN follows the methodology of the EuroWordNet [13].

Proposed System Description

The system contains two main components,the first component is responsible for the knowledge construction, and the second one does the summarization. Figure 1 shows the proposed system architecture.

Corpus pre-processing & NLP

Concept

Extraction

Relation

Extraction

Knowledge construction

Query

AWN Query Expansion

Knowledge Base Expansion Expansion

Summarization

Producing Summary

Figure 1. Proposed system architecture

3.1. Knowledge construction

Corpus pre-processing and NLP

Different Arabic articles were collected from the internet, any certain domain articles are grouped together so they can be fed to the system, which gives the user the privilege of building knowledge to any desired domain. Some pre-processing is made to the corpus before knowledge extraction can be applied. Any none Arabic words or letters, numbers, diacritics (ØªØ´ÙƒÙŠÙ„), symbols or none letters such as brackets or quotations, extra spaces or empty lines are removed. Stop words arenâ€™t removed because they will be used in the concept and the relation extraction.

Stanford POS tagger is then used to determine the type of each word, i.e. noun, verb, etc. Stanford POS tagger is a Java implementation of the log-linear part-of-speech taggers described in [14].

Concept extraction

In our system we extracted the multi-word concepts, which are composed of words that co-occur together more often than can be expected by chance. We chose that because according to some studies such as [15], most domain-specific concepts are multiword terms. The small number of relevant single-word terms can either be found appearing frequently in the multiword terms or easily inferred based on the multiword terms. Also single word concepts may include general concepts as well as domain ones, and the relation extraction might not be so easy especially in the Arabic language, with the lack of dependency parsers and human intervention.

There are a lot of approaches to extract multi word keywords; a survey about that can be found in [16]. But just to mention them, some use statistical approaches then apply frames on the results to exclude the unwanted patterns such as [18] and [19], and some other approaches apply parsing or use POS tagger to choose certain patterns such as [15] and Text-To-Onto. All approaches then calculate the frequency of each pattern among the corpus, some of them stop at that step and some others extend the calculations to enhance performance such as the domain relevant measures (DRM), terminology identification measure (TIM) in [15], or the C-value/NC-value. In this paper we used the C-/NC value measures because in the previous research it showed better precision and recall values than to just depend on the frequencies [16]. And the C-/NC-value method is an efficient domain-independent multi-word term recognition method which combines linguistic and statistical knowledge [17].

The C-value/NC-value algorithm [20]:

Tagging the corpus.

Choose certain patterns for candidate concepts: Noun+Noun, (Adj|Noun)+Noun, (Adj|Noun)+ (Adj|Noun)* or with a prep among them. In the algorithm the maximum number of words to form a concept was not defined. But in Arabic language it is reasonable not to consider more than four words. It becomes more problematic to extract multi-word concepts of more than three words [19], and also in [18] they extracted only four.

A stop list which is a list of words that are not expected to occur as keywords in that domain, though they might exist more frequently. In the approach they studied the domain texts they had and produced a list, which will not be needed in query based summarization because the user's query eliminates the undesired concepts.

Calculate the total frequency of the candidate string in the corpus.

The frequency of the candidate string as part of other longer candidate terms.

The number of these longer candidate terms.

The length of candidate strings (in number of words).

Calculate the C-value.

Relation extraction

Techniques for taxonomic relation extraction vary from using a dependency parser to determine the relation among concepts to statistical approaches. Some tools use dependency parsers to determine if one concept is the subject of another, which indicates a relation, while other tools use shallow parsers or POS taggers and consider the following pattern (Noun1, Verb, Noun2) [15] depending on that verbs are hypothesized to indicate semantic relations between concepts. Some other tools consider the relations of "is-a" and "has-a" such as [18], and some tools use a mixture of that. All these tools also detect the frequency of concepts that appear together. Text2Onto [21], developed JAPE patterns for both shallow parsing and the identification of concepts and different types of relations. JAPE rules have to be developed by humans who are aware of the domain, and the rules are processed using GATE the NLP tool. Text2Onto supports only the English language for the easiness of JAPE rules creation.

Due to the lack of Arabic dependency parsers, in this paper we used the "is-a" relations such as: "Ù‡Ùˆ, Ù‡ÙŠ, Ù‡Ù…Ø§, ...", "has-a" relations such as: "ØªØªØ£Ù„Ù Ù…Ù†, ØªØªÙƒÙˆÙ† Ù…Ù†, ØªÙ†Ù‚Ø³Ù… Ø¥Ù„Ù‰,..." and this list is prone to stretching if any new phrases were discovered. If any two concepts are mentioned together with a high frequency, it's considered a relation. Finally we adopted (Concept1, Verb, Concept2) approach to determine relations using the Stanford POS tagger.

In our approach we didn't build the ontology hierarchy, because it is out of the scope of the summarization and the hierarchy wouldnâ€™t add a value to the already built knowledge of concepts and relations.

3.2. Summarization

Corpus collection and pre-processing

As mentioned before that several articles about different domains are collected from the internet, when the user enters his query, the corpus is searched for all the related articles to the query. The articles are then pre-processed in the same way we explained in the previous section.

AWN query expansion

The user's query is expanded by running it against the Arabic WordNet. AWN doesnâ€™t provide JAVA API like the EuroWordNet, so we had to use the AWN database and access the source code to retrieve all the synsets. If the user's query has any stop words, symbols, none letters or none words, they are removed before expansion. The user's validation for the expansion results is asked for, and he is given the chance to remove any words that seem irrelevant.

Knowledge base query expansion

The original query is expanded against the knowledge base of concepts and relations. All the related concepts to every word in the query and their relationsare added. Also the expanded query is finalized by the user. The finalized and validated query is then used in the summarization.

Producing summary

As produced in the adapted approach, every article is summarized as follows; each sentence in the corpus is given a score by comparing every word in it with the original and the expanded query. If the word exists in the original query it is given a score of "1" and if it is in the expanded query it is given the weight of "0.5". The score of the whole sentence is the summation of the words' scores. The list of sentences for each article is sorted in an ascending order. The user is asked for the preferred showed amount of summary, which is minimum 50% of the original document, but the user can choose less. The sentences with the highest scores are then displayed to the user in the same order they appeared in the original document.

Evaluation and Results

Evaluation of summarization is a quite hard problem. Often, a lot of manual labor is required, for instance by having humans read generated summaries and grading the quality of the summaries with regards to different aspects such as information content and text clarity. Manual labor is time consuming and expensive. Summarization is also subjective. The conception of what constitutes a good summary varies a lot between individuals, and of course also depending on the purpose of the summary [28].

Some automatic text summarization tools use DUC (Document Understanding Conference) datasets to test their algorithms and some of them use human evaluation such as [22], while others use the abstract of an article as the human summary.

Due to lack of Arabic datasets or proper Arabic papers with abstracts, this approach [23], translated DUC datasets using Google translate. In our case we used The EASC, it is an Arabic natural language resources. It contains 153 Arabic articles and 765 human-generated extractive summaries of those articles. These summaries were generated using Mechanical Turk [24], which is asubsidiary of Amazon.com that provides a Web services system that uses people to perform tasks better handled by humans than computers [25]. Among the major features of EASC, that names and extensions are formatted to be compatiblewith current evaluation systems such as ROUGE and AutoSummENG[24]. This data also suited our domain knowledge base summarization; because the articles are divided into topics, i.e. art & music, education, environment, finance,health, politics, religion, science & technology, sports, and tourism.

We compared our WordNet query expansion summaries with the human summaries once and then again after adding the knowledge-based query expansion. For the knowledge base building, we collected more articles about the required domain, from the World Wide Web, to create a richer list of concepts and relations.

For the evaluation we used the latest version of ROUGE 1.5, available in [26].ROUGE is based on an n-gram co-occurrence between machine summaries and human summaries and is a widely accepted standard for evaluation of summarizationtasks [27].The ROUGE evaluation was performed having N-gram 1:1 and confidence interval of 95%, which was believed to give results close to those of human evaluation [28].

Table 1. Best results

Avg_R

Avg_P

Avg_F

WordNet query expansion

0.20510

0.02885

0.04291

WordNet & knowledge base

0.29400

0.10981

0.12411

Table 1 shows the results for comparing each approach with the human summary given in EASC. Avg_R is the average recall, Avg_P is the average precision and Avg_F is the measure of test's accuracy[29]. Recall is the fraction of relevant instances that are retrieved, while precision is the fraction of retrieved instances that are relevant[30]. As we can see that using the knowledge base query expansion enhances the performance.

Conclusion and Future Work

Our system is a query-based single document summarizer; it expands the user's preferable query using AWN. We also detected how adding a knowledge base to the query expansion would increase the performance of the system, which was proven by the evaluation results of ROUGE. To extract the knowledge, we used statistical and lingual measures on an existing domain corpus, getting a list of the domain's main concepts and the relations amongst them. We used EASC datasets for the evaluation, which is an Arabic articles grouped into certain domains, and each article has multiple human generated summaries.

The possible future work for this paper includes

To build the ontology hierarchy, this might, using some complex association rules algorithms, detect new concepts relations that cannot be extracted explicitly from the corpus.

Also if a similar approach could be found to compare the results with, or maybe translate some of the existing relative approaches data into Arabic, to perform better and wider range of comparison.

Expand our work to multi document summarization.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now