Test Model For Summarizing Hindi Text

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Chetana Thaokar [1] Dr.Latesh Malik [2]

MTech. Student Professor & HOD.

Department of Computer Science & Engg. Department of Computer Science &Engg.

GHRaisoni College of Engineering, GHRaisoni College of Engineering,

Nagpur. Nagpur.

[email protected]

Abstract

As amount of information available on the web is getting double day by day which is leading to information overload. To find important and useful information is becoming difficult. Automatic summary generation technique addresses the issue of generating shortened information from documents written on the same topic. This systems are most interested and attractive research areas. It offers a possibility of finding main points of texts and so user can spend less time on reading whole document. This Paper discusses the idea to summarize Hindi text documents

using sentence extraction method. It uses d wordnet to tag appropriate POS of word for checking SOV of the sentence. It also uses genetic algorithm to optimize the summary generated based on the text feature terms which will cover maximum theme with less redundancy. The proposed work is under development.

Keywords : Text Features, Hindi Wordnet, Extraction Method,

Genetic Algorithm, Sentence Rank

1. INTRODUCTION

Text summarization has become an important tool for analyzing and interpreting text documents in a fast growing information world. According to Lin and Hovy[1] "Summary can be defined as a text that is produced from one or more texts, that contains a significant portion of the information in the original text, and that is no longer than half of the original text" .Whereas Mani[2] describes " text summarization as the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks).The generation of summary by computer is different from the human generated summary as human can understand and capture the depth of the document on the basis of syntax and semantics whereas automating it will require the machine to understand the language depth which is a complex task. Summarization can be useful in everyday life for eg. Headlines of news , Abstract summary of technical paper , Review of book or Preview of a MovieAutomatic Summarization can be categorized as Extractive and Abstractive. Extractive summary consists of sentences or word sequences, phrases and paragraph contained in the original documents concatenated in shorter form having higher ranks based on feature values. Unfortunately extracts suffer from inconsistency, lack of balance and moreover sentences extracted tend to be lengthy . whereas abstractive summary aims to produce summary which usually requires linguistic data for interpreting and analyzing text.It heavily depends on the computational power of Natural Language Processing. The biggest challenge for abstractive summary is the representation of natural language by the system. Automatic summary generation has advantages as size of summary can be controlled and content is deterministic as original sentences are retained as summary.

As internet is growing, the ratio of people using it is increasing without having any language barrier. As Hindi is official language of India and it is also spoken in Fiji, Surinam, Mauritius, Nepal . There are 600 to 700 Million hindi speaking people worldwide. So automatic summarization will be an important tool for people who do not know English but want to read articles on the Internet precisely and also be an help to data analyst. Hindi is written in the Devanagari script which has largest alphabet set.

2.OVERVIEW OF EXTRACTION BASED SUMMARIZATION TECHNIQUES

There are various techniques to generate extractive summary which helps to find relevant sentences to be added to the

summary. This can be classified as : Statistical , Linguistic and Hybrid approach.

2.1 Statistical Method

Text summarization based on this approach relies on the statistical distribution of certain features and it is done without understanding whole document. It uses classification and information retrieval techniques. Classification methods classifies the sentences that can be part of the summary depending on the training of data. Information retrieval technique uses Position , length of sentence or word occurrences in the document . This method extracts sentences that occurs in the source tex t, without taking into consideration the semantics of the words [1].

2.2 Linguistic Method

In this, method needs to be aware of and know deeply the linguistic knowledge, so that the computer will be able to analyze the sentences and then decide which sentence to be selected. It identifies term relationship in the document through part-of-speech tagging, grammar analysis, thesaurus usage and extract meaningful sentences. Parameters can be cue words , Title feature or Noun and verbs in the sentences[3]. Statistical approaches may be efficient in computation but Linguistic approaches look into term semantics, which may yield better summary results. In practice, linguistic approaches also adopt simple statistical computation (term-frequency-inverse-document-frequency (TF-IDF) weighting scheme) to filter terms. It is however, relatively few researches in literature to discuss the linguistic approaches that adopt a term weighting scheme derived from a formal mathematical (probabilistic) model to make more sense in weight determination.

2.3 Hybrid Method

It exploits best of both the previous method for meaningful and short summary [4].

3.RELATED RESEARCH OVERVIEW

The majority of early research focused on the development of simple surface-level techniques that tend to indicate important passages in the original text. Most of the system use sentences as unit. Early techniques for sentence extraction computed a score for each sentence based on features.

Following Table1 describes some of the previous summarization work based on the text features and the techniques used for sentence scoring .

Name & Year

Features for Extraction

Luhn , 1958 [5]

Word and phrase frequency

Edmundson ,1969[5]

Word frequency, cue phrases, title / heading words and sentence location

ANES , 1995 [5]

Term and sentence weighting (tf*idf )

Barzilay & Allahabad ,1997 [5]

Topic identification of the text by grouping words into lexical chain

SUMMARIST,1998 [1]

Topic Identification, Interpretation and Generation

Cut & Paste, 2001 [6]

Sentence Reduction and sentence combination techniques Key sentences are identified by a sentence extraction algorithm that covers lexical coherence, tf*idf score, cue phrases and sentence positions

K.U. Leuven, 2003 [5][7]

Topic segmentation, Sentence Scoring based on weight and position

LAKE , 2004 [5][7]

Keyphrase extraction based on supervised learning & Linguistic features like name entity recognition or multiwords.

Arman Kiani,2006

[8]

Title and thematic words using genetics and fuzzy system for extraction of sentences

NetSum , 2007 [9]

Machine learning using neural network algorithm

Khosravi & Dehkordy 2008 [9]

Sentence length, Sentence Title, Keyword using Fuzzy systems

Suanmali ,Salim, Binwahlan,2011 [10]

Title feature ,length, Proper noun position and term weight. GA for sentence scoring

Table 1. Related Summarization Work

4. PROPOSED SYSTEM

Goal of extractive text summarization is selecting the most relevant sentences of the text. The Proposed method uses statistical and Linguistic approach to find most relevant sentence. As shown in Fig.1, summarization system consists of 3 major steps. Preprocessing ,Extraction of feature terms and Genetic algorithm for ranking the sentence based on the optimized feature weights.

Preprocessing Step

Extracting Words &

Sentence Features

SOV Qualification

Genetic Algorithm

Sentence Ranking

Summary

Hindi Text Document

Fig 1. Proposed Scheme

4.1 Preprocessing step

Preprocessing, involves preparing text document for the analysis. This step involves Sentence segmentation , Sentence tokenization , Stop word Removal and Stemming.

4.1.1 Sentence Segmentation

It is the process of decomposing the given text document into its constituent sentences along with its word count . In hindi, sentence is segmented by identifying the boundary of sentence which ends with purna viram ( | ) .

4.1.2 Tokenization

It is the process of splitting the sentences into words by identifying the spaces, comma and special symbols between the words. So list of sentences and words are maintained for further processing.

4.1.3 Stop word Removal

To effectively use word feature score we need to only consider the words in the document which have importance. Common words with no semantics and which do not aggregate relevant information to the task are eliminated / removed. Stopwords are common words that carry less important meaning than keywords .This words should be eliminated otherwise sentence containing them can influence summary generated. In this experiment 185 words are used as Stopwords. List of stop words are given in Table 2.

पर

उन्हों

बिलकुल

निहायत

ऱ्वासा

इन्हीं

उन्हीं

उन्हें

इसमें

जितना

दुसरा

कितना

दबारा

साबुत

वग़ैरह

दूसरे

कौनसा

थी

इन

वह

यिह

वुह

जिन्हें

जिन्हों

तिन्हें

तिन्हों

किन्हों

किन्हें

इत्यादि

द्वारा

इन्हें

इन्हों

सकते

इसके

सबसे

होने

दबारा

साबुत

दूसरे

कौनसा

लेकिन

होता

करने

किया

लिये

अपने

नहीं

दिया

इसका

करना

वाले

हौ

था

है

Table.2 Stopword List

Fig.2 Preprocessing phase

Fig.2 Shows the result of Sentence Segmentation, Tokenization and Stopword Removal with word count details of every sentence.

4.1.4 Stemming

Syntactically similar words, such as plurals, verbal variations, etc. are considered similar, the purpose of this procedure is to obtain the stem or root of each word, which emphasize its semantics. For eg. Foxes , root word of foxes is fox.

Stemming is used for matching the words of sentence for checking similarity features. Stemmer used is developed by IIT Mumbai.

4.2 Feature Extraction

Real analysis of the document for summarization begins in this phase. In this every sentence is represented by an vector of feature terms .Which checks for every sentence statistically and linguistically. Each sentence has a score based on the weight of feature terms which in turn is used for sentence ranking. Feature term values ranges between 0 to1.

Following section describes the features used in this study

F1: Average TF-ISF ( Term Frequency-Inverse Sentence Frequency) :

TF means to evaluate distribution of each word over the document . Inverse sentence frequency means the terms that occur in only a few sentences which are more important than others that occur in many sentences of the document. In other words, it is important to know in how many sentences a certain word exists. since a word which is common in a sentence ,but also it is common in the most of the sentences that is less useful when it comes to differentiating that sentence from other sentences.

This feature is calculated as (1)

SF= Sentence frequency is count of sentence in which word occurred in a document of N sentences. So

tf*isf = TF * ISF (1)

Average tf*isf is calculated for each sentence and assigned as weight to the sentence

F2 : Sentence Length

This feature is useful to filter out short or long sentences. Too short or long sentence is not good for summary. This feature computation uses minimum and maximum length threshold values. In our experiment it is 10 and 20 words in a sentence . The feature weight is computed as (2) and graphically represented as given in Fig.3

SL = 0 if L < MinL or L > MaxL (2)

Otherwise

SL = Sin (( L-Min L) * ((Max θ - Min θ)/(Max L - Min L)))

Where, L=Length of Sentence

MinL=Minimum Length of Sentence

MaxL=Maximum Length of Sentence

Min θ =Minimum Angle ( Minimum Angle=0)

Max θ =Maximum Angle ( Maximum Angle=180)

Fig.3 Sentence Length

F3: Sentence Position

Position of the sentence in the text ,decides its importance. Sentences in the beginning defines the theme of the document whereas end sentences conclude or summarize the document. In this threshold value in percentage defines how many sentences in the beginning and at the end are retained in summary whose weight is

SP =1 (3)

Remaining sentences, weight is computed as follows and graphically shown in Fig.4

Sp = Cos ((CP - Min V)*((Max θ - Min θ) / (MaxV - Min V)))

where TRSH = Threshold Value

MinV = NS * TRSH ( Minimum Value of Sentence )

MaxV = NS * (1 - TRSH ) ( Maximum Value of Sentence )

NS = Number of sentences in document

Min θ = Minimum Angle ( Minimum Angle=0)

Max θ = Maximum Angle (Maximum Angle=180)

CP = Current Position of sentence

Fig.4 Sentence Length

F4 : Numerical Data

The Sentence that contains numerical data is important and it should be included in the summary. The Weight for this feature is calculated as

ND(Si) = 1 , Digit exist (4)

0, Digit not exist

Si = Sentence i

F5: Sentence to Sentence Similarity

This feature finds the similarity between the sentences. For each sentence S, similarity between S and every other sentence is computed by the method of stemmed word matching.

where (5)

Where, N = Number of Sentence

WT = Total Words in Sentence Si

Individual sentence weight based on similarity is the ratio of SS to N-1

F6: Title Feature

Title contains set of words that represents gist of the document. So if a sentence Si has higher intersection with the title words then we can conclude Si is more important than other sentences in that document. Title score is calculated as

(6)

F7: SOV Qualification

Sentence is a group of words expressing a complete thought, and it must have a subject and a verb The word order in Hindi is somewhat flexible. However, the typical word order of the most of the sentences is <subject> <object> <verb>. For this reason, hindi is sometimes called an "SOV" language. For SOV qualification of a sentence, each word in a sentence is tagged by assigning part of speech like (Noun, Adjective ,Verb, Adverb). The input to tagging algorithm is a set of words and specified tag to each. Tagging process is to look for the token in a look up dictionary. The dictionary used in this study is Hindi Wordnet 1.2 developed by IIT Mumbai. WordNet is an lexical database in which nouns, verbs, adjectives and adverbs are grouped organized into synonym sets or synsets, each representing one underlying lexical concept. A synset is a set of synonyms (word forms that relate to the same word meaning) and two words are said to be synonyms if their mutual substitution does not alter the truth value of a given sentence in which they occur, in a given context.

Now based on the tags assigned, the first noun word in the sentence is marked as subject of the sentence. Whole sentence is parsed till its end, if verb is last word of the sentence than sentence is qualified as SOV. Only those sentence which are qualified as SOV will be used for further processing. Sentence after removing stopword is used

For eg. सरकार तरफ सत्र आखिरी दिन लोकसभा पेश मुलायम विरोध जारी रहेगा।

Word

POS

SOV

सरकार

Noun

Subject

तरफ

Noun

Object

सत्र

Noun

Object

आखिरी

Adjective

Object

दिन

Noun

Object

लोकसभा

Noun

Object

पेश

Adverb

Object

मुलायम

Adjective

Object

विरोध

Noun

Object

जारी

Noun

Object

रहेगा

Verb

Verb

Table 3. POS Tagging

SOV qualification is calculated as

SOV(Si) = 1 , SOV Qualified (7)

0 , SOV Not Qualified

F8: Subject Similarity

For subject similarity feature results of previous step is used to match subject of the sentence with the subject of the title. It can be similar to noun checking of title and sentence. Noun plays an important term in understanding the sentence. It is given as

Sub(Si) = 1, if POS is noun and root value of title and

sentence is equal

0, otherwise

4.3 Genetic Algorithm

Genetic algorithms are described as heuristic search algorithms [11] based on the evolutionary ideas of natural selection and natural genetics by David Goldberg. Genetic algorithm is an alternative to traditional optimization techniques by using directed random searches to locate optimal solutions. GA generates a sequence of populations by using a selection mechanism, where crossovers and mutations are used as part of the search mechanisms. Before applying GA, we must encode the parameters of the problem to be optimized. GA work with codes that represent the parameters. In this study , chromosomes are encoded as floating number. Each gene represents a specific feature as float number between 0 and 1.There are 8 features used in this study so each chromosomes has 8 genes. Genetic Algorithm Structure is discussed below.

1. Initial population

It is a set of initial chromosomes which are randomly generated within a generation. Chromosome is a collection of genes that form a certain value, which is represented as a solution or an individual. Let N be the size of a chromosome population. The population of chromosomes is randomly generated in the beginning. Random function is used to generate a random array of bits [0, 1].

2. Fitness function

The fitness value reflects how good a chromosome is compared to the other chromosomes in the population. The higher chromosome has a higher chance of survival and reproduction that can represent the next generation. Fitness function is defined to maximize the summation value

(8)

fj is feature value of sentence

3. Selection

It is a stage in the functioning of genetic algorithms to choose the best chromosome for crossover and mutation process and get a good prospective parent. If an individual has a high fitness value is likely to be selected. If a chromosome has a small fitness value, then it will be replaced by new better chromosomes. Roulette wheel selection [18] method is used in which the first step is to calculate the cumulative fitness of the whole population through the sum of the fitness of all individuals. After that, the probability of selection `is calculated for each individual being as Eq. (9)

(9)

Then, an array is built containing cumulative probabilities

of the individuals. So, n random numbers are generated in the range 0 to Pi and for each random number an array element

which can have higher value is searched. Therefore, individuals are selected according to their probabilities of selection. The basic advantage of roulette wheel selection is that it discards none of the individuals in the population and gives a chance to all of them to be selected.

Fig. 5 shows Roulette Wheel Selection Process

Fig.5 Roulette Wheel Selection

4. Crossover and mutation

The function of the crossover is to generate new or child chromosomes from two parent chromosomes by combining the information extracted from the parents. In each generation, new chromosomes are generated. The one point crossover is chosen using a random function. A mutation operator involves a probability that an arbitrary gene value will be complemented in a chromosome. In this, new or child chromosomes is generated by complementing the gene value. These processes will continue until the fitness value of individuals in the population converges or fixed number of generations reached.

5. Sentence Ranking

The document sentences are scored using (eq.8). Using GA, best chromosome is selected after the specific number of generations. Then using Euclidean distance formula distance between sentence score and the fittest chromosome is evaluated . Sentences are sorted based on the ascending order of distance values. Depending on the compression rate sentences are extracted from the document to generate summary.

5. EVALUATION OF SUMMARY

It is essential to check the usefulness and the trustfulness of the summary. Summarization system evaluation can be performed manually by human experts. But there is a problem with this method because individual who does this task have different ideas on what a good summary should contain. Another problem with this method is, it is time consuming process.

Automatic generated summaries can be evaluated using following parameters[1][3]

1) Precision : It evaluates correctness for the sentences in the

summary

P = Retrieved Sentences ∩ Relevant Sentences

Retrieved Sentences

2) Recall : It evaluates proportion of relevance included in

the summary

R= Retrieved Sentences ∩ Relevant Sentences

Relevant Sentences

Where Retrieved Sentences are retrieved from the system

and Relevant Sentences are Identified by human

6. CONCLUSION

This Paper discusses single document summarization using extraction method for Hindi text, which uses 6 statistical and 2 linguistic feature. Many techniques have been developed for English text summarization but less efforts are made for Hindi language.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now