Ranking Method Which Employs Semantic Similarity Computer Science Essay

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Information Retrieval (IR) is a domain that is interested in the structure, analysis, organization, storage, search and discovery of information. The challenge of IR is to find in the large amount of available documents; those that best fit the user needs. The evaluation of IRS is to measure its performance regarding to the user needs, for this purpose evaluation methods widely adopted in IR are based on models that provide a basis for comparative evaluation of different system effectiveness by means of common resources. In this context, several questions arise regarding the improvement of the information retrieval process, and the manner in which returned results are evaluated.

Research on search engine has a long history in philosophy, psychology, and artificial intelligence, and various perspectives have been suggested from both academic field and industry. Users can obtain different kinds of information associated with their keywords from current commercial search engines, such as Google. However, the results of these search engines are lack of semantic relationship among each other since they may be ordered by importance only. Furthermore, there are many information redundancies of these results, and it is always time-consuming for users to find out the most relevant information they want. Hence, this paper introduce a ranking method based on the calculation of semantic similarity of web search results to improve the user experience of using search engines, especially finding out the most relevant information they want.

Whenever Re-ranking algorithm is performed two important things should be taken into account the first is the original queries and the second is the initial retrieval scores. A web search engine is designed to search for information on the World on the World Wide Web and FTP servers. Google Search is the most-used search engine on the World Wide Web receiving several hundred million queries each day through its various services. Snippets, a brief window of text extracted by a search engine around the query term in a document, provide useful information regarding the local context of the query term. [3]

Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable data space, whether on the Web or within a closed system, to generate more relevant results. Hence, this paper proposes an automatic ranking system based on the calculation of semantic similarity of these results to improve the user experience. This paper is based on the semantic similarity. For calculating the semantic similarity which is based on WordNet each snippet fetched from the N top result of the search engine. A balanced similarity ranking method combined with Google’s rank and timeliness of the pages is proposed to rank these snippets. And hence Re-ranking is done based on the similarity.

LITRATURE SURVEY

Related Work

Keyword matching techniques is used by the current search engine which has following Weaknesses. First, many times Web users cannot express their search intention accurately using several keywords. And hence the exactly-matched results do not consequentially satisfy the web users. Secondly, keyword matching cannot guarantee the selected candidates have high correlation with the user query, given the different positions and meanings of the keywords. Third, under the circumstances of keyword matching, the top ranked search results for a given query must contain the keywords as much as possible; otherwise they will lose their ranking positions although their contents exactly discuss the same thing. This will also lead to an awkward situation: spammers try their best to pollute the web document corpus with term spamming tricks such as repetition, dumping and weaving.

Another problem about current search engines is their ranking schemes. PageRank is the most popular ranking algorithm; however, it is based on the popularity of web documents, not the quality. Therefore, a newborn web document usually cannot get highly-ranked positions due to their freshness and thus little reputation. The search based on lexical semantics in which content are semantically check instead of keyword matching which can better adapt to the thinking pattern of human beings, and thus search results are more relevant to users’ search intention. Meanwhile, using semantic factors can conciliate the freshness and make the high-relevant new pages get moderate rank promotion. In our work, we fetch the top N results returned by search engines such as Google for user queries, and use semantic similarities between the candidate and the query to re-rank the results. We first convert the ranking position to an importance score for each candidate. Then we combine the semantic similarity score with this initial importance score and finally we get the new ranks..[1]

In this paper we are interested specifically in the semantic evaluation of the results returned by search engines. This choice is motivated by their popularity in the Web community on the one hand and the degree of selectivity that they offer on the other. More precisely, this system allows to: [2]

Retrieve the results returned by search engines

Check the information content of each returned page.

Project the user query on the linguistic resource, the WordNet ontology.

Measure the results relevance by calculating the relevance degree of each of them.

Generate a semantic rank of results according to the calculated relevance based on their degree of in formativeness.

Assign a score to each web page based on its position in the new ranking.

This system is based partly on a linguistic resource (WordNet ontology) for the query semantic projection and on the other hand, a calculation model for measuring the relevance 'document/ query'. In the following we are justifying our choices in terms of the chosen linguistic resource and the used IR model.

In literature survey Graph based ranking algorithms [6] was develop which can be very helpful when searching for important pages in the World Wide Web. Such algorithms capitalize on the existence of explicit links (e.g., hyperlinks, citations) between the graph vertices. In the case of flat text collections, neither links nor citations exist, so the need to devise implicit edges between text keywords or sentences arises. One solution is to exploit the contextual information of terms and create semantic graphs from text based on content similarity.

Use of Ontology

The need for using ontology for information retrieval (IR) has been explored by some approaches to better answer users’ queries. In the information filtering process, this aspect will be the subject of the contribution that we present in this paper. The idea is to use ontology to add the semantic similarity between the user intent and the web return result. This can be done by extracting the query terms and their semantic projection using the WordNet ontology on the set of returned documents. The result of this projection is used to extract concepts related to each term, thus building a semantic relation which will be the base of the web page ranking. In this paper a Wordnet ontology is used.

WordNet ontology is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. WordNet distinguishes between nouns, verbs, adjectives and adverbs because they follow different grammatical rules. It does not include prepositions, determiners etc. Every synset contains a group of synonymous words or collocations.[8]

While semantic relations apply to all members of a synset because they share a meaning but are all mutually synonyms, words can also be connected to other words through lexical relations, including antonyms (opposites of each other) which are derivationally related, as well.

WordNet also provides the polysemy count of a word: the number of synsets that contain the word. If a word participates in several synsets (i.e. has several senses) then typically some senses are much more common than others. WordNet quantifies this by the frequency score: in which several sample texts have all words semantically tagged with the corresponding synset, and then a count provided indicating how often a word appears in a specific sense.

The morphology functions of the software distributed with the database try to deduce the lemma or root form of a word from the user's input; only the root form is stored in the database unless it has irregular inflected forms.

WordNet can be interpreted and used as a lexical ontology in the computer science sense. However, such ontology should normally be corrected before being used since it contains hundreds of basic semantic inconsistencies such as (i) the existence of common specializations for exclusive categories and (ii) redundancies in the specialization hierarchy. Furthermore, transforming WordNet into a lexical ontology usable for knowledge representation should normally also involve (i) distinguishing the specialization relations into subtype of and instance of relations, and (ii) associating intuitive unique identifiers to each category

Calculating Semantic Similarity & Ranking Snippets

For Calculating semantic similarity, ontology must be specified first. For example, WordNet1 is a very famous and widely used ontology. Based on WordNet, researchers have already put forward some semantic similarity formulas. For example, Leacock and Chodorow propose the following formula to compute the semantic between two concepts [7]:

where len( π1; π2) is the length of the shortest path between concept π1 and π 2 in WordNet and depth(_) is the length of the path from to the root.

In this section, systematic method is proposed to classify the results associated with keywords from Google into corresponding categories by comparing their semantic similarity automatically. Google results associated with corresponding keywords are used as the input for this system.

A WordNet based semantic similarity algorithm is proposed to calculate the similarity of these extracted words. A combined similarity for each snippet is calculated from the similarity of the word, which is the element of one snippet. This combined similarity value is used as a factor for ranking of snippets based on similarity. In each ranking, the snippets are ordered in a descending sequence according to the similarity and other factors.

PROPOSED DESIGN

Theoretically Proposed Approach

Two different Phases as shown in the Fig. 1 below

untitled

Fig 1:- Re-ranking Method

As Shown in figure1 first a query is fire on a backend search engine like Google, Google will return a result which is based on keyword matching. For improving the Web search result this paper present a new approach. First snippets are fetched from the N top result return by the Google search engine. Stemming and part of speech are performed on the snippets and then using ontology like WordNet a semantic similarity is calculated. Based on the Semantic Similarity of the snippet and the query and the importance score of the original ranking a new re-ranking algorithm is applied which will improve the quality of search engine.

The main stages of the project are as follows:-

Perform a Pre-processing on retrieved information from search engine.

Calculate semantic similarity score between snippets and topics.

Calculating the importance Score for each document.

Balanced method of ranking web pages from semantic similarity and importance score.

Perform Semantic Ranking.

3.1.1 Performing a Pre-processing on Retrieved Information from search engine:-

The first phase is to retrieve the snippets from Google results. From each paragraph, system can extract one snippet in free text by eliminating the HTML tags. A snippet is defined as:

Snippet = paragraph (Title) + paragraph (Abstraction).

First remove stop words by comparing the words in each snippet with the one in a stop list. The stemming process and POS are employed to disambiguate the forms and the appropriate meaning of the words in each snippet. The system utilizes TreeTagger, which is a language independent POS tagger, to implement these tasks. The final outcome of this phase is a group of "processed snippets", which consists of a set of extracted words.

Calculate semantic similarity between snippets and Query

In order to re-rank the snippets into the most relevant user’s query, an algorithm is proposed to calculate the similarity between the snippets and the topics. In case of traditional semantic similarity calculation method only word distance is calculated but in this paper presents two important factors. The first factor is based on the path of concepts in WordNet. The second one is based on the information content of concepts which comes from entropy measurement. Moreover, a coefficient is employed to balance the weight of the two factors for accuracy revision.

We utilize both the logarithm of the concept of distance, which refers to the number of the nodes between two words, to reflect the change trend of the similarity and the density of two words on the semantic path. The corresponding equation based on these for term T1 and T2 is as follows:

where Depth(T) is the number of levels of the term counting from the root node "Entity"; the is the maximum number of the level in the taxonomy of WordNet; is the number of common hyponyms nodes between the two terms including themselves, in which the nodes are the closest common parent nodes in WordNet.

Calculating The Importance Score for each document:-

This paper is heavily depended on the search engine’s quality and result, how to grade the web pages returned from the search engine is important. So we need to find criteria to measure each web page. Before introducing our method, we would like to show one important concept, the DCG, short for Discounted Cumulative Goal. It is a common way to evaluate the search result quality using DCG. The expression (3) of DCG is as follows.

Where p is PageRank serial number and rei is the graded relevance of the result at position i. This expression has a discount factor, which makes the lower the rank; the smaller share of the document value is added to the cumulated gain. The lower the ranked position of a relevant document - of any relevance level - the less valuable it is for the user, because it is less likely that the user will examine the document due to time, effort, and cumulated information from documents that already seen.

Let us go back to how to measure the importance of the results at different positions. As we know, the search results are returned by search engines according to their importance and relevance. The most important web pages usually are returned at the top positions, and hence attract much more attention from users. On the contrary, the unimportant pages are returned at the bottom positions. Therefore, a discount factor is needed which progressively reduces the document value as its rank decreases. The following formula is proposed to calculate each web’s importance score.

Where i is the original PageRank serial number, tot is the number of the fetched web pages for a query.

CONCLUSION

The paper presents an approach for improving the web search result. Ontology similarity is unquestionable important for Semantic Web search engine. This paper tries to propose an ontology similarity based approach to measure similarity between the users query and web page. This paper makes an attempt to propose a solution for information retrieval to retrieve higher occurrence of the concepts, within the web pages dynamically, Which reduces the effort made by the user for searching the required concept, Semantic Re-ranking algorithm is proposed which will re-rank the original search result based on the importance score and semantic similarity. This proposed re-ranking algorithm improves the search engine result.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now