The Multiple Search And Similarity Result Computer Science Essay

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Abstract

The increase of dependency on web information demands an improvisation in web search for accurate and highly similar results. Most search engines retrieves results based on its web crawled index database it maintains, which limited the scope of search. It also performs similar search technique for both sort and long query. To deliver consistently superior results one must understand the exact intent of the query and each keyword in the query strength. This paper focuses on proposing a multiple search and similarity result merging framework for web search improvisation with identification of strength of keywords of user query. Based on the strength of query computation multiple searches are made in different search engine for obtaining multiple results. To merge the obtain result a similarity conversion based method is proposed which provides high accurate and similar results for search improvisation.

Keywords: Web Search, Search engine, Similarity measures, Query processing, Multiple Searching

INTRODUCTION

Searching is the second most popular activity on the web after emails service. Search engines are prominent tools which are used for searching. Any search engine hardly capable of covering of huge resource of web for searching [1][4]. Searching is becoming complex due to variation of user query length for search. A user query can be a single word, phrases or questions. In most cases, the correct result is depends on how efficiently the user query processed and related to the information retrieved [3]. A study on information retrieval to evaluate the overlapping of results retrieved of three most popular search engines namely Google, Yahoo, Bing ,Ask, AltaVista and Excite reveals that 85% results are repeating, 80% results were found common in all search engines and 3% of results were unique. The high percentage of repeating and unique result infer that only a single search engine result may not be sufficient to provide the relevant and required results, and at the same time an efficient processing of user query to find the strength is important for accurate web search. Even if user perform same search in different search engine its leads to more dissatisfaction as same results is repeating.

A user search queries are often based on an approximation and synopsis of information needs. Accurately matching against the terms in the search query is a woefully inadequate method for finding the correct or even correlated information [19]. In contrast, user approximation can leverage the associations between these terms in order to understand what the user really wants to find. These terms may be vague and not even be correct for the specific type of information, especially since the user may only have a rough approximation of the target information or be unfamiliar with the mechanism of search engines. So, it’s very important to analyze user queries before putting to search engine for result retrieval. The proposed approach suggests using multiple search engine based on the strength of user query. The search result return by multiple search engines needs to merge effectively to obtain accurate result.

RELATED WORKS

Many research studies are made on search queries on web search engines. One of the oldest research works [9] described the analysis of a large AltaVista query log and describes the correlation analysis of the log entries, studying the interaction of terms within queries. Beeferman D et.al [12] describes how the queries issued and the URLs clicked from the results can be viewed as a bipartite graph and how agglomerative clustering can be applied on the bipartite graph to discover clusters of similar queries and similar URLs. Baeza Yates et.al [5] describes a way to represent queries in a vector space based on a graph derived from the query click bipartite graph and shows how the representation can be used to infer some semantic relationships.

Measuring the semantic similarity[18][20] between two texts has been studied extensively in the information retrieval and natural language processing communities such as stemming [8][13], translation models [11] and query expansion [6][14]. However, there is no research has done on the assessing the strength of a query posed by the users for information retrieval.

Border A. [15] analyzed that user’s need behind their queries to a search engine. The study reveal that based on the need user queries can be classified into three groups such as Navigational queries, Informational queries and Transactional queries. The analysis states that 48% are informational queries, 30% are transactional queries and 20% are navigational queries. So, it is very important to understand what kind of need behind a user posing a query. Our proposal computes the queries keywords weight with relate to user query logs history of a user, and based on the overall keywords weight the no of multiple searches needs to make are decided.

FRAMEWORK FOR MULTIPLE SEARCHING

The proposed framework describes multiple searching shown in Figure-1. It consists of Query processing, Keywords Weight, Multiple Searching and Results Ranking and merging as major components. Each component combined to perform to improvise the web search.

Search Query

C:\Program Files\Microsoft Office\MEDIA\CAGCAT10\j0195384.wmf

Query Processing

Keywords Weight

Multiple Searching

Query Logs

Result Merging

Search Engines 1

Search Engines 2

Search Engines 3

Search Engines n

…..

Search Results

Search Results

Update Query

Figure 1 – Framework for Multiple Search

3.1 Query Processing

As described that user generally submit a query bases on need or approximation which generally consists of lot of determiners, preposition, adjectives and nouns. It is necessary filter out the key words from the user input query. Query processing is a text processing block which does the extraction of keywords and filtration of determiners and prepositions from user query phrase submitted for search. For example if a user submit a query as "recent improvement in medical treatment for cancer", the keywords extracted are as "recent", "improvement", "medical", "treatment", "cancer" and "in", "for" are filtered out by the query processor. Query processer maintains a filter library to filter the determiners and prepositions from the query. It also updates the user query to Query Log database.

Keywords Weight

The weight of keywords in a query explains the hardness of the query and intension of the user information need implicitly. Here hardness is used to make a decision for number of search need to make for efficient information retrieval. Keywords weight block of the framework computes the each keywords weight using user history query logs.

Suppose a query Q represents a vector of keywords, K as (k1 , … , kn) and L is a vector of query logs represent as (l1,.., ln) for a user. If a keyword, k does not appear in L its weight will be zero. If a keyword is present in L, its weight are computed based on two factors as Keyword frequency (kf) and Log frequency (lf). The kf of a keyword based on the number of times it appears in the query logs and lf is the number of logs records related to keywords K. So, the keyword weight (kw) will be compute as

(1)

Using the value of kw we will compute the hardness percentage of a query as Qhard,

(2)

Qhard, percentage decides the number of search needs to be made for information retrieval. Silverstein et al. [9] revealed that most users submit simple and sort queries on web search queries, which receive most common result search by a single search engine. The proposed approach improvise this drawback by computing the keyword weight and query hardness for sort or long query using user past query log history and perform multiple search accordingly to meet the user need and search improvisation.

Multiple Searching

Multiple searching is a novel concept proposed in this paper based on the hardness of query. Current Meta Search engine are developed based on the multiple search engines but most the Meta search engines are focus on effective merging techniques of search engine results rather than the user information need which some time present very high irrelevant results. As described in section-1, that 85% of results are repeating for same query in different search engine, a novel approach is proposed to overcome the blind search on multiple search engines and a decision based search approach has made based on the query hardness. To take a decision we compute the query hardness using methods (1) and (2).

To support multiple searching as M, lets assume n search engines as S = (Si , … .Sn ) are integrated in the framework and each search engine has rank based on there popularity and coverage scope. For effectively selection of number of search engines for multiple searching we make four ranges of query hardness as shown in Table-1.

Table-1

Query Hardness Range and Multiple Searching

Hardness Range (Qhard)

Multiple Searching (M)

0 – 25%

M=n/4

>25 – 50%

M=n/3

>50 – 75%

M=n/2

>75 – 100%

M=n/1

Based on the M value number of searching is made and the selection of search engine is based on there rank score. If M=1, then highest ranked search engine are used for searching and if M=2, then top 2 ranked search engine are used for searching. So, based on the hardness range and multiple searching effective results are retrieved. As multiple search engines are used for searching in turn multiple results are obtained. The obtained results undergo a result merging process to present the results.

Search Results and Result Merging

Search results block collect the multiple search results returned by multiple search engines and the collected results processed for result merging by result merging block of the framework. Several result merging algorithms are proposed [10][17] but merging based on the local rank of results is simple and yields good performance. Similarity conversion based methods for converting ranks to similarities proposed by Lee.J [16]. We propose a modified similarity conversion methods for effective ranking and merging of the result to improvise the search result.

Let’s assume for a query Q is submitted to n number of search engine which participating in multiple searching. Top 10 results of each search engine are collected as Rn = (r1, … , r10). A unique set of results are collected from search engines through verifying with buffered search results. The verification eliminates the duplicate results and continues the search till top 10 non duplicates result are collected. To effectively merge the collected result we follow two algorithms. First, we calculate each result similarity rank and second, we find each result relevancy similarity to the query to find the final rank of the search result.

First Algorithm: To rank the unique results, a modified similarity conversion method derived for each result as sim_rank (r) based on the local result rank and search engine rank score value.

(3)

where, r is result, rrank is the local rank of the result r and Sscore is search engine rank score value. Search engine rank score is a constant value assigned depends on the popularity survey. The value of Sscore is high as 1 for highly popular and low as 0 for lower popular search engine.

Second Algorithm: The ordered results might be correct in relate of rank conversion but in case of query relevancy it may differ. We calculate the relevancy based on the results content information. The search results of search engine contain rich information about the retrieved results [2], especially the title and snippets of search result can reflects the high relevancy to the corresponding document of the query. This approach utilizes the title and snippets of the results to find the relevancy similarity.

Let’s assume, DT and DS are number of distinct query terms in title as T and snippets as S for a query q, and DQ as distinct . QDlen is the total number of distinct terms in query. NT and NS are total number of query terms in the title and snippets. Tlen and Slen are total number of terms in title and snippets. TOrder as query terms order in title or snippets, if terms order same as query then QTOrder=1 else will be 0. Then, to compute the relevancy similarity of title and snippets following method will be used,

To find the final similarity of the result and query we merged the value as follows,

Where, R is result and is total number of distinct query terms.

(7)

The computed value of of a result will be used for final ranking. Higher the value, higher will be the rank of the result.

EXPERIMENT AND RESULTS

For experiment evaluation we developed the designed framework using Java Enterprise Technology, Tomcat Web Server and DAT File for storing user query logs having a three column data structure for user id, query and timestamp. Five search engines as Yahoo, Bing, Ask, AltaVista and Excite are integrated with framework for evaluation. To test the developed framework we submit multiple repetitive queries to the designed framework to build query log database of 3000 records. We assign rank score to the search engine as Yahoo=1, Bing=0.8, Ask=0.6, AltaVista=0.4 and Excite=0.2.

To find the effectiveness of the web search improvisation we find the results relevancy to the query submitted and rank of those results to measure the recall and precision in with and without multiple search integration. Recall is measured in the proportion of the relevant results that are retrieved to a query and precision is measured in the proportion of the retrieved results that are relevant as described below.

(8)

(9)

To test the proposed framework we submit a query without multiple search approach selecting a particular search engine for retrieving results, here we select the search engine having highest search score and same query we repeated with multiple search approach also. The experiment repeated this test several times submitting various ambiguities query to evaluate the effectiveness and improvisation.

The experiment observe that all search engines repeats the same results on posing same query or in a little modified query, where as the proposed framework improvise this search on repetitive query or modified query as the hardness of queries increases with repeat search query which retrieve more relevant results filtering out the duplicate results in support of multiple search.

To measure the improvisation we compute recall and precision on different queries in each run with observing the three required factors as Retrieved results, Relevant results to Query and Relevant search results. As we presents top 10 from the merged result to the user so retrieved results factor always remains 10 only. Other two factors are observed on each run and an analysis is shown Figure-2 and Figure-3.

Figure-2 and 3 describes the precision and recall comparison between with and without multiple search approach. The result shows that with multiple searches the precision and recall ratio is higher compare to without multiple searches. The high ratio of precision with multiple searches is due to more relevant result retrieved from the retrieved results and the high ratio of recall is due to more relevant results retrieval to the query. As recall measuring both factor are high for relevant result and relevant result to query results a higher ratio in comparison.

Figure 2 – Precision Comparison of With and Without Multiple Search

Figure 3 – Recall Comparison of With and Without Multiple Search

The low result ratio of without multiple searches is due to the same result repeating for the same query repeating multiple times or even for different or modified query. Same result retrieval minimizes the relevant results and relevant result to query which directly low down the ration of precision and recall.

So, high the precision and recall on retrieved results improvise the user needs and satisfaction. It might possible the retrieve result may not be relevant to query or the user need, but repetitively submitting same or similar query improvise it’s search results with increasing the hardness of the same query and leads to multiple searches for presenting more relevant and non-duplicate results.

CONCLUSION

In this paper we propose a novel framework to improvise the web search. Most search engines retrieve same results on same or similar query which create dissatisfaction to the users and degrading web search. We improvise this drawback with proposing a novel multiple search and result merging approach. The proposed approach measures the hardness of the query in support of past users query log. Hardness of a query computed based on the keywords weight present in query. Based on the hardness a decision is made for number of multiple search need to make. Multiple searches results are processed to retrieve unique results which further processed for effective ranking and merging using similarity conversion based methods. The experiment results support our proposal with improvisation in precision and recall ratio comparison measures of with and without multiple searches. In future we like to extend our approach to retrieve more relevant results to a query and user using user behavior model construction.

REFRENCES

H. Chen, M. Lin, Y. Wei, Novel association measures using web search with double checking, in: Proc. of the COLING/ACL 2006, 2006.

M. Sahami, T. Heilman, A web-based kernel function for measuring the similarity of short text snippets, in: Proc. of 15th International World Wide Web Conference, 2006.

Jarvelin, K. and Kekalainen, J. "IR Evaluation Methods for Retrieving Highly Relevant Documents", In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pages 41-48, 2000.

K. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In First Latin American Web Congress, pages 132–143, 2003.

Baeza Yates R. and Tiberi A. Extracting semantic relations from query logs. Proceedings of the 13th ACM SIGKDD conference, San Jose, California, USA, Pages 76-85, 2007.

Lavrenko, V. and Croft, W.B. Relevance based language models. In Proceedings of SIGIR ‘01, pages 120-127, 2001.

Spink, A.; Jansen, B.J.; Blakely, C.; Koshman, S.; "Overlap Among Major Web Search engines", ITNG 2006 Third International Conference on Information Technology: New Generations, 2006. Page(s):370 – 374, 10-12 April 2006.

K. Church, P. Hanks, Word association norms, mutual information and lexicography, Computational Linguistics 16 (1991) 22–29.

Silverstein R., Helzinger M., Marais H. and Moricz M. Analysis of a very large AltaVista query log. SRC Technical Note, 1998-014, October 26, 1998.

M.Shokouhi and J.Zobel Robust result merging using merging using sample-based score estimates. ACM Trans. Information System, 27(3):1 -29, 2009.

Berger, A. and Lafferty, J. Information retrieval as statistical translation. In Proceedings of SIGIR ’99, pages 222-229, 1999.

Beeferman D. and Berger A. Agglomerative clustering of a search engine query log. Proceedings of the sixth ACM SIGKDD international conference, Boston, Massachusetts, United States, Pages 407-416, 2000.

Krovetz, R. Viewing morphology as an inference process. In Proceedings of SIGIR ’93, pages 191-202, 1993

Zhai, C. and Lafferty, J. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of CIKM ‘01, pages 403-410, 2001.

Border. A., Taxonomy of web search ACM SIGIR Forum 36(2):3-10, 2002.

Lee.J, Analyses of Multiple Evidence Combination. ACM SIGIR Conference on Research and Development in Information Retrieval,Page 267-276 1997

Craswell, n., Hawking, d., and Thistlewaite, P.. Merging results from isolated search engines. In Proceedings of the 10th Australasian Database Conference. pp. 189-200., 1999

P. Resnik, Using information content to evaluate semantic similarity in taxonomy, in: Proc. of 14th International Joint Conference on Artificial Intelligence, 1995.

B. Vlez, R. Wiess, M. Sheldon, D. Gifford, Fast and effective query refinement, in: Proc. of 20th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, 1997.

D. Lin, An information-theoretic definition of similarity, in: Proc. of the 15th ICML, 1998.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now