Query Paraphrasing Using Genetic Approach Computer Science Essay

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Mourad Ykhlef

Department of Information System

CCIS, King Saud University

Riyadh, Kingdom of Saudi Arabia

[email protected]

Abeer ALDayel

Department of Information Technology

CCIS, King Saud University

Riyadh, Kingdom of Saudi Arabia

[email protected]

Abstract— The major problem that we will solve in this paper, is how to bring the user closer to his/her search needs and how to enhance the search query prior submitting it to the search engine. In this paper we build a Query paraphrasing model using genetic algorithm that can find an optimal set of paraphrased Queries which best match the user’s need by exploring different Query paraphrased combinations using WordNet as thesaurus. Experimental result shows that the proposed system can provide more precision than the regular query paraphrasing.

Keywords- Query paraphrasing; Information Retrieval; Optimization, Genetic Algorithm.

Introduction

Information Retrieval researchers realized that it was hard for users to formulate effective search queries. Also most of search engines suffer from insufficient search results that have no relation to the submitted query which means that many of the resulted documents which contain the words listed in the query are insufficient. Emerged Hypothesizes indicate that "the search is likely to be more accurate and precise if it is based on meanings rather than on words" [1]. Paraphrasing is away to improve the retrieving process and move the querying to new level, the meaning level.

Paraphrasing for Information Retrieval is known as query reformulation that can be declared as restatement of a text. This can be accomplished by replacing words with their synonyms, hyponyms and changing words order. Actually many search engines users find themselves reformulating the query words until they find the most relevant results for their search. Synonymous paraphrasing can be defined as "changing of natural language text or of its fragments that preserves the meaning of the text as a whole."[1;2].

Usually in search process users will formulate a query that satisfies their information needs in several iterations. The query might have been too broad, resulting in a wide range of documents being returned in the result set, so the relevant documents are mixed in with documents which had no connection with the searched topic. Alternatively, the query might have been too narrow, only identifying a small subset of relevant documents, in these cases; the user will try to improve the search result by reformulate the query. Study done by [3] Shows that proportion of users who modified queries was 52%, with 32% issuing three or more queries within the same search session. Thus supporting the assertion that query paraphrasing is a common part of the search process [4]. Studies of [5;6] show that 80% of web searchers view only the first ten retrieved results, therefore query paraphrasing is important to bring the user closer to his/her search needs and to enhance the search query in order to get from the first hit the more relevant results.

The current search engines such as Google or Yahoo, offer an efficient way to browse the web contents. However, the retrieved quality is highly rely upon users compiling potential paraphrases and searching the documents to identify when those are actual paraphrases, that would be very time consuming, and require a lot of effort. The task of retrieving relevant Web documents is still complicated for the majority of non-expert users who cannot express their needs with significant words typed in the query.

Users may not always use the proper words for a query terms. Others may not even notice that the terms they picked for the query will lead to unexpected or irrelevant results to the user’s actual request. Consequently, the information retrieval models need Query paraphrasing in order to improve search performance and satisfy what the user really needs.

The automation of paraphrasing the initial user’s query could be a good mechanism to improve the quality of information retrieval results. Instead of just matching keywords to return documents, search engine can identify Paraphrases of query; it can also improve the experience of searching by providing a wider range of relevant search results, a broader set of answers, and possibly less duplicated content in search results.

Up to our knowledge, there are no other studies that have used genetic algorithm to enhance query paraphrasing system. Given this, in the current paper we built an experimental information retrieval system using a query paraphrasing technique integrated with genetic optimization approach. The Genetic algorithm will help in finding better paraphrases of a query and go over the exhaustive computation with no user supervision. This system lead to improvement in information retrieval system with average recall rate 0.981 and average precision rate 0.556.

Previous Works

Three approaches to lexical paraphrasing have dominated the literature. The first approach acquires paraphrases from dictionaries, such as the WordNet[1];in their work they used synonymous paraphrasing of the text based on WordNet synonymy data and Internet statistics of stable word combinations (collocations). The authors of [8] used WordNet and part-of-speech information to propose synonyms for the content words in the queries. In [9] they used three information sources to generate lexical paraphrases of Internet queries: WordNet, the Webster online dictionary, and a combination of the Webster-based thesaurus and WordNet. The second approach collects lexical paraphrases from monolingual or bilingual corpora. In the research don by [10] they extracted lexical paraphrases using multiple resources: a monolingual dictionary, and a monolingual corpus, a bilingual corpus. The study of [11] used a parallel corpus to identify paraphrases from a corpus of multiple English translations of the same source text. The third approach is based on using query logs for extracting paraphrases automatically by acquiring lexical context-specific paraphrases from the web like studies [12] and [13].

Using optimization algorithms (Genetic Algorithm) in Query Reformulation have attracted scientists. Genetic algorithm has been employed in query expansion to overcome the exhaustive calculation of finding the most similar term to the query concept and improve the original query. Researches in the area of query expansion using genetic algorithm has followed several avenues table 1 shows a Taxonomy of Genetic algorithm used in query expansion based on the chromosome representation, fitness function and the query expansion technique. The work done by Al-Shaor, Hmeidi, & Najadat [14] was a modified approach for Query Expansion using Global analysis and Genetic Algorithm (NAQEGA) ,with similarity value between terms as fitness function and vectors corresponding to the terms supposed to be added to the query as chromosomes . This approach improves the average recall rate by 9% when compared to the traditional Concept based Query Expansion.

Araujo & PÑrezAgera [15] used Spanish morphological thesaurus along with stemmer and the Genetic algorithm used to select from the candidate terms the final query terms and the chromosome is a fix-length binary strings where each position corresponds to a candidate query term. In study done by Boughanem & Chrisment [16],the chromosome was correspond to an indexing term or concept. The initial population (set of queries), contains the initial query and a list of relevant documents retrieved by this initial query. Thus, at the first iteration of the GA, a relevant document is considered as a query.

From previous studies we can conclude that there is no Query paraphrasing study that used genetic algorithm enhancement approach; the previous studies which used Query Expansion approaches with genetic algorithms were focus on a single term in order to enrich the query with more relevant words. In contrast, we will use alternative lexical paraphrases using paraphrased Query and optimization algorithm to rank paraphrases and overcome exhaustive calculation as described in the methodology section.

Table 1 Researches in the area of query expansion using genetic algorithm

Query expansion technique

Chromosome representation

Fitness function

Authors

Global analysis

vectors of terms

similarity

[14]

morphological thesaurus

binary strings

similarity

[15]

Local analysis

vectors of terms

similarity

[16]

methodology

In this research, we will perform query paraphrasing by generating lexical paraphrases of queries; these paraphrases are generated using WordNet thesaurus. The query-paraphrasing aspect of our work will use the lexical query paraphrasing approach proposed by [8] and we will add genetic algorithm approach as optimization enhancement in order to choose the best paraphrases of the initial query and to improve the quality of query paraphrasing.

In order to evaluate the impact of the optimized paraphrasing approach on retrieving documents we will build an experimental system we will referred to as the GAQP System. Fig. 1 shows the general architecture of GAQP system and illustrates the connection between all functional components of the system, including post-processing and search.

Figure 1 System overview

The main system's components are:

Tokenize, tag and lemmatize the user query, The Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech [17].

Paraphrase generator: this model consists of three phases, the first phase is synonyms generation this phase will construct list of synonyms for each term in the original query using WordNet thesaurus. Fig.2 shows an example of original query "causes of scurvy disease" which have three Synonyms lists one for each term. The Second phase is encoding, in this phase the synonyms lists will be encoded into integer numbers lists represent the index in each synonyms list. In Fig.2 we have three synonyms lists in this case encoded list 1 consist of 1000 * index of synonym in Synonyms list 1 and so on. We chose the multiplication of 1000 because this will give us maximum synonyms length up to 999 for each term in the original query. To have better paraphrasing queries, the terms in the original query are included in its corresponding synonyms list. The third phase is genetic algorithm phase Fig.3. This phase proposes queries paraphrasing for the original query using different combinations of replacement Synonyms using encoded synonyms lists to find the best queries paraphrasing for the original query and it works as follows:

Chromosome representation: we used a fixed length chromosome corresponding to the query paraphrasing’s, and its length is equal to number of terms in original query. Fig.3 shows an example of population contains different combinations of encoded synonyms lists for original query (Causes of scurvy disease).

Selection: this process takes the best chromosomes by evaluating their fitness value into next generation.

Fitness function: we used a score of paraphrases proposed by [8] as fitness function. This function is based on how common are lemmas combinations in paraphrased query. The approximation of the paraphrase score composed of n lemmas is as follows:

Where freq(ti,tj) is the frequency of two consecutive terms of paraphrased query in the documents corpus.

Genetic operation (Crossover/mutate) randomly selects two Chromosomes from the population and "mates" them by randomly picking a gene and then swapping that gene and all subsequent genes between the two Chromosomes. The two modified Chromosomes are then added to the list of candidate Chromosomes.

Indexing and Searching: Searching is the process of looking up words in an index to find documents where they appear. We used Lucene which is a Java Library to provide a kernel Search to retrieve relevant documents provided queries.

Figure 2 Paraphrase generator phase 1 and 2

Figure 3 Phase 3 of paraphrase generator model

Experiment

In order to verify the effectiveness of the proposed system, experiment is carried out on two collections USENET and Wikipedia dumps files using an Intel Core i7 PC machine of 2.80 GHz CPU speed and 8GB RAM. USENET collection contains text files of public USENET postings. This text corpus was collected between Oct 2005 and Jan 2011[18]. Wikipedia dumps files are collection of Wikipedia articles pages. And we use WP2TXT [19] to extract a plain text data from Wikipedia dump file that are encoded in XML; and to stripping all the MediaWiki markups and other metadata. The total collection size is 11.5 GB containing 762 text files. Before implementing the actual solution, we used the following specifications to prepare genetic algorithm, population size of 100 and a total number of 50 evolutions. As termination criteria the program ends when 50 evolutions are reached.

Our experiments have been aimed at testing how much the genetic algorithm can improve the results of the query paraphrasing system, assuming a good fitness function can be found, the quality of this System was measured using precision and recall metrics. Recall measures how well the search system finds relevant documents; whereas precision measures how well the system filters out the irrelevant documents. We used 11 queries to carry out our experiment table 2 shows a sample of the paraphrased queries for two submitted queries using GAQP system. Human judgment used for each query to decide which of the retrieved documents are relevant. Fig.4 shows a comparison of precision rate to eleven queries between GAQP system and regular query paraphrasing System proposed by [8]. The average precision rate values for GAQP system and the regular query paraphrasing System respectively are 0.556 and 0.46.The average recall rate value for GAQP system is 0.981 and the regular query paraphrasing System has 0.964 as average recall rate as shown in Fig.5. We can see that the deviation in the recall values produced by GAQP system was relatively higher than that observed by the regular paraphrasing System. The experiment’s results showed that the precision in GAQP system is better by approximately 9% than the regular query paraphrasing system. This improvement is due to the use of genetic algorithm as enhancement approach. The regular query paraphrasing system tends to scan all the possible queries paraphrasing in contrast the GAQP system have chromosomes as paraphrase queries and guided by fitness function to find better paraphrased queries and to go over the exhaustive computation.

Original QueryTable 2 Sample of the paraphrased queries

Paraphrased Query

Causes of scurvy disease

Kind of flora

Paraphrase 1

Causes of scurvy illness

Form of plant

Paraphrase 2

reason of scurvy sickness

Kind of plant

Paraphrase 3

Causes of scorbutus malady

Form of vegetation

Paraphrase 4

Causes of scurvy unwellness

Sort of plant

Paraphrase 5

Causes of scurvy disease

Sort of vegetation

Figure 4 Precision curve to eleven queries

Figure 5 Recall curve to eleven queries

Conclusion

We have offered an automatic mechanism for the generation of lexical paraphrases of queries using Genetic Algorithm and have compared the performance of this mechanism to regular paraphrasing IR system our results show that the query paraphrasing using genetic Algorithm improves the average recall rate and the average precision rate. As future work we plan to develop the performance of query paraphrasing system by combining more than one optimization algorithms. We would also like to extend our work to different documents corpus with different characteristics like Arabic language.

Acknowledgments

This work was supported by the Research Center of College of Computer and Information Sciences, King Saud University. The authors are grateful for this support.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now