Contemporary Approach For Information Retrieval Computer Science Essay

Published Date: 02 Nov 2017

1Department of CSE, Anna University of Technology, Coimbatore, India

2Department of Information Technology, Sona College of Technology, Salem, India

ABSTRACT

Web mining has stretched its boundary for mining large data repositories for the purpose of generating the accurate responses for the user defined queries. Subsequently there is a lack of heterogeneity in structure of web data there needed for the evolution of web mining. Similar to behaviour of data mining and text mining, Web content mining differs on appending the amount of data/information on the web that is hurriedly increasing day by day. The significant issues with respect to web content mining are the critical decision-making process. The previous research works are focused on the prime demerits in which first top two ranking sentences are taken to be considered for retrieval of information; consequently due to the redundancy fact clustering accuracy was to bare minimum level. In our proposed work, Semantic Based Proximity Enclosure (SBPE) mechanism applied deals on calculating the semantic based proximity score for predicting several document constraints and then follows the document resemblance matrix construction, finally the documents are clustered applying the RMBC (Resemblance Matrix Based Clustering) algorithm according to the resemblance matrix generated. Resultantly User pertinent documents are retrieved based on the query weighting methodology. The simulation results portray that our proposed mechanism facilitates effectively when applied to a large set of documents moreover illustrates superior optimization results in terms of accuracy too.

Keywords: Information Retrieval, Web mining, Text Clustering, PoS Tagging, Stop-Word Elimination, Resemblance Matrix, Query Weighting, Proximal Enclosure

1. INTRODUCTION

Usually web is a superfluous collection of several documents. Online repositories are growing rapidly and due to that web search engines return information tends to monotonous browsing. Stupendous opportunities and challenges are offered by the web related to data mining.

The existing data on the web is most of all linked to other the data. There exist many hyperlinks among and across the same and different sites. The redundant information may be placed across same and different pages. Day by day so much mischievous activity are ongoing on the web. There are plenty of challenges to be encountered with the website parameters. Updating and monitoring the data are another crucial issue. For instance, examining the website provides answers to several pertinent questions such as "Who the users are" ,"What they looked at" and "At What time their interest changes". Most of all these are the significant questions to be deliberately answered in Customer Relationship Management (CRM).

Highly speaking the organizations are more prior to fulfill the customerâ€™s desire. At the beginning itself the needs of the customer are known, so that companies can react to customer needs faster. By utilizing the acquired imminent of customer requirements the production cost can be minimized. The profitability of the company can be increased by means of incrementing the pricing based on the profiles collected.

In this study we focus on web content mining for creating user relevant profiles from web documents which are being collected initially. Preprocessing is done to eliminate the unnecessary terms and HTML tags. The following documents are clustered based on Semantic based Proximal Enclosure Mechanism; this technique was in concert with the clustering and query weighting methodologies. Before that, the document resemblance matrix is computed, which is preceded on applying the Bayesian approach based technique and the study then follows clustering on applying Resemblance Matrix Based Clustering (RMBC) algorithm. Finally the study concludes on query weighting mechanism and then ranking the documents.

The remaining session of the study is summarized as follows. Section 2 personalizes the related works. Section 3 shows the proposed systemization. Experimental results are described in section 4. Finally this study ends by having conclusion in section 5.

2. RELATED WORK

It is being established that web document clustering is a component method intended for navigating and browsing large document collections and organizing the results returned by search engines in response to user queries. Despite the fact that there are many optimal algorithms for clustering, the important issue of feature selection, is, what attributes of the data should be used by the clustering algorithms. The important issue of feature selection is selecting the data attributes that are relevant for clustering. Feature selection for clustering was difficult due to the absence of class labels. A novel feature selection method introduced by Zeng et al., [9] for Gaussian mixture clustering eliminates the redundant features by exploiting Markov Blanket filter. This method extracts the subset with smallest relevant features that deeply represents the data partitions. Initially, the aforementioned index for measuring feature relevance is applied. Secondly, the applicability of Markov Blanket filter in unsupervised learning acquires the non redundant relevant features. This way of feature selection works well in terms of integrated into the RPEM clustering algorithm (rival penalized expectation-maximization).

Clustering feature selection is difficult because, unlike in supervised learning, the class labels are lacking in the data, thus there was no obvious criteria to guide the search. With small cluster, it is critical to maximize the speed of convergence. In the Gaussian mixture model, the rapid speed of convergence of Expectation Maximization (EM) algorithm depends on mixture components overlapping. The anti-annealing algorithm introduced by Naim et al., [10] significantly enhances the convergence speed of EM. Implementation of this algorithm possesses globally robust convergence for data set with smaller clusters and doesn't require any line search.

There was an approach explained in the study Zhang et al., [2], cluster the documents using frequent patterns. Firstly, a review about the document clustering by exploiting frequent pattern was presented with an analysis. Secondly, the proposed MC approach is explored to generate document clusters. The key conceit is to constrain document pairs having greater similarity in the same cluster. To produce document clusters, enhanced minimum spanning tree algorithm was introduced in this study with the usage of the most frequent item set of a cluster as a topic. An alternative approach presented by Huang et al., [14] facilitates document clustering based on a similarity measure like Euclidean distance and relative entropy, because document clustering attains accuracy when relate the similarity between the pair of objects.

Unsupervised feature selection algorithm generally conducts feature selection in a global sense, which is invalid when the data set comprises a more local property. Li et al., [6] addressed this issue in terms of localized feature selection for all clusters. For that purpose, the Cross projection method is deployed, which then estimate the optimal feature subset for all clusters. Bayesian framework is an approach implemented for the computation of feature saliency for all clusters. The study work explained in Song et al., [5] provides a query based association rule mining regarding the submitted queries. The study work in Pratap et al., [7] exemplifies the solution for the problem occurred in clustering by exploiting efficient density based k-medoids. This algorithm effectively surmounts the problem of existing clustering algorithm namely DBSCAN and kmedoids.

The conceit developed for automatic verb classification depend manual classification on semantic features. Sun et al., [12] enhance the aforementioned work by acquiring semantic features in automatic classification. A novel framework was proposed by Shehata et al., [13] for verb clustering assimilates sub categorization acquisition system. This system possesses greater performance only in high dimensional feature space. A technique that comprises the analysis of concept based terms and the similarity measures which provided an efficient mechanism for text clustering. The unsupervised feature selection methodology is well suitable for the data sets having large proportions and volume. A novel unsupervised feature selection method for text clustering proposed by Shamsinejadbabki et al., [17] has explored genetic algorithm for identifying the most valuable groups of terms. The extracted terms are then bestowed to produce the final feature vector for clustering process. There was another methodology based on unsupervised learning in document clustering proposed by Shailendra Kumar, et al., [3]. Affinity propagation (AP) is deployed in that study, which cluster the text document based on phrases and the vector space model.

By the meantime, Li et al., [8] introduced two clustering algorithms namely, Clustering based on Frequent Word Meaning Sequences (CFWMS) and Clustering based on Frequent Word Sequences (CFWS). These algorithms explore the use of word sequences in a document.

In the interim, Self-Organizing Map (SOM) was a very useful neural network to visualize large dimensional data for mining knowledge; the method developed by Silva et al., [15] achieved a generic method for feature clustering with Self Organizing Maps (SOM). This method of employing neural network in input data set enhances the visualization capabilities of the SOM. The proposal described on Liu et al., [11] aids the users in the analysis of large text collections. LDA based topic analysis technique introduced in this study summarize the collection of documents by automatically deriving a set of topic. The enforced interaction tool aids the user to visualize the text collection of multiple perspectives. Meanwhile, the study Kurian et al., [1] intends to cluster the documents based on short contextual information within the document. An unsupervised learning method named SOM (Self Organizing Maps), accomplish the mapping of source to clusters based on text context.

Information retrieval was the vast area that deals with the document search based on some query. A novel summarization approach for web search was presented by Pembe et al., [16] that rely on structure preserving documents summarization and query biased. The framework is divided into two aspects. First aspect process the web documents and generate a rule based approach whereas second aspect utilize the document structure obtained through first aspect and devise an automatic summarization method.

Query expansion is another concept that is used to improve the performance of the information retrieval process. In Carpineto et al., [4] a survey for automatic query expansion based on the query logs were presented and the correlations were established between terms from the speech transcript and visual concepts.

The learning technique was discovered based on a Classification-Expectation-Maximization (CEM) algorithm is a well-known EM algorithm in which a Classification step was performed between each Expectation and Maximization steps. In our proposed mechanism a test was made based on CEM to have a comparison our proposed systemâ€™s throw put thresholds. In our study query weighting based on Cluster methodology was used to captivate the relevant documents.

3. PROPOSED MECHANISM

Web mining refers to the general way of discovering conservative and previously unknown information or knowledge from the Web data. Usually in web mining pre-processing is an essential step needed to extract the intrinsic structure of a web document.

3.1. Preprocessing

This is the initial step in pre-processing. This process is alternatively called as feature 138 extractions. The web documents are represented in the vector space model. In this model terms which are called as words are extracted from the corpus. Every so often too much information can drastically reduce the effectiveness of data mining. A few of the columns of data attributes are assembled for building and testing. The number of information may actually detract from the quality and accuracy.

3.2. Feature Selection

Irrelevant lexicons may include noise to the data that affects accuracy. Noise is directly proportional to the size of the model, the time and system resources. Nevertheless, data sets may contain correlated groups of attributes. Same underlying feature are actually measured by these attributes. Their presence together in the data built can skew the logic of the algorithm and affect the accuracy of the model. Wide data (many attributes) generally present processing challenges for data mining algorithms. The dimensions of the processing space used by the algorithm are depending on the Model attributes. The superior the dimensionality of the processing space, the computation cost involved in algorithmic processing is also larger. The effects of noise, correlation and high dimensionality are some form of dimensionality reduction; sometimes a desirable preprocessing step for data mining. Feature selection and extraction are two mechanisms to crucially decide on dimensionality reduction.

The prime concern in extorting the intrinsic terms involves elimination of HTML tags. Generally the web documents will be enclosed with HTML tags such as paragraph, hyperlink, Meta tags and some more HTML tags. Sometimes some other scripting languages like VB Script and Java Script will be also be embedded in a web document. It is needed to parse those web documents using some parser. Thus the document contains the skeleton of texts.

3.3. Parts-Of-Speech (POS) Tagging

Parts-of-Speech tagging is the way of substituting a part-of-speech like noun, verb, pronoun, preposition, adverb, adjective or other lexical class marker to each word in a sentence. The input is a string of words of a natural language sentence and a specified tag set (a finite list of Part-of-speech tags). The output is a single best POS tag for each word.

3.4. Stop-Word Elimination

Stemming is the procedure for reducing inflected (or sometimes derived) words to their stem, base or root form generally a written word from. It is not necessary that the stem word should be similar to the root; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. In this study Portal algorithm is applied to stemming process.

3.5. Semantic Based Proximity Enclosure Mechanism (SBPE)

The result of the pre-processed data then enter into the stage of clustering. Document Clustering comes under the category of centralized process. In this process, we have to perform SBPE mechanism. The mechanism deals with four phases. First phase is the formulation of Proximity Score. The score constructed with this mechanism is then processed with the Similarity calculation. The resemblance matrix computation is the second phase of this mechanism. To find the resemblance between the documents, the Bayesian Approach based method is applied. The third phase is to form the clusters by using the RMBC algorithm. The clusters are formed with the resemblance matrix constructed in the second phase. Finally, the weight of each cluster is calculated to retrieve the relevant information during the search process. The retrieved documents are ranked according to the query weight.

3.6. Proximal Score Formulation (PSF)

In this SBPE mechanism, data are pre-processed initially, where subsequently followed the PSF computation with a non-negative score. This non-negative score of each document can be calculated by Equation (3). The value of equation (3) can be obtained from (1) and (2). Before calculating the score, we have to evaluate the tf (Term Frequency) and the Qi (the frequency of number of occurrences of a word) for each document. The formula for computing the Proximal Score is as follows:

(1)

(2)

(3)

Where:

N = The number of words in the document

Âµi = The frequency value of a word

wi = The weight of each word

xi = The proximal score of each document

tfi = The occurrence of a word in that document.

Table 1 (a) Resemblance Matrix

D1 D2 D3 D4 D5

D - 0.3 0.4 0.6 0.3

1 4.0 21.0 32.0 2.0

D 0.2 - 0.8 0.3 0.1

2 21.0 91.0 21.0 2.0

D 0.9 0.4 - 0.4 0.4

3 12.0 54.0 78.0 65.0

D 0.3 0.7 0.2 - 0.0

4 67.0 65.0 74.0 98.0

0.9 0.8 0.2 0.3 -

D5 87.0 76.0 31.0 24.0

Table 1. (b) Binary Matrix

D1 D2 D3 D4 D5

D1 - 0 0 1 0

D2 0 - 1 0 0

D3 1 0 - 0 0

D4 0 1 0 - 0

D5 1 1 0 0 -

The proximal score calculated above is considered the weight of each document. The similarity of documents is measured in resemblance matrix construction.

3.7. Resemblance Matrix Construction

The resemblance matrix construction is to find the monotonous between the documents. The Bayesian approach is used here for RMC formulation. Bayesian Approach is an important technique that is based on the probability theory which provides an optimal result. The documents are processed 199 based on this approach and formulates a resemblance matrix. The general form of this Bayesian approach is:

where, P (A|B) denotes the monotonousness between the two probability measures. This can be applied to evaluate the documentâ€™s monotonousness with respect to the following algorithm 1:

Algorithm 1. Construction of Resemblance Matrix

where, and j be the two documents, S(i) and S(j) be the scores of documents i and j. This process is iterated for all n documents. Subsequently, the resultant of the approach forms the resemblance matrix m[i, j].

3.8. RMBC Algorithm

Clustering is the process of grouping the data with respect to some constraint. The data within the cluster are similar among one another whereas, they are dissimilar with the data outside the cluster. Depending on the purpose of clustering and the nature of the data, the clustering techniques may differ. In our approach, we use a Resemblance Matrix Based Clustering algorithm for clustering the matrix constructed with the resemblance matrix construction technique. Fig 1. exposes our proposed system architecture.

The proposed clustering algorithm contains two phases. The first phase is to convert the similarity matrix into a binary matrix as shown in Table 1a and b. Second step is to cluster the documents with respect to the binary matrix. The conversion to binary matrix has the threshold that is:

where, Î² = 0.5 and RV is the resemblance value. As a subsequent process of Binary matrix construction, the clustering of documents begins. The documents in each row are compared with the other documents with respect to the binary matrix value Bij where i represent the row and j represents the column. The documents with the Bij value as 1 are grouped together. Similarly, individual clusters are formed for the documents provided with the Bij value as 0. The number of clusters varies dynamically depending on the number of documents. Hence, the cluster count cannot be predicted in advance. The documents encircled in a particular cluster are not allowed to enter into a new cluster. For example, consider five documents D1 to D5, calculate the proximal score using the equation (3) and find the resemblance matrix value as shown in Table 1a.

Consider the Table 1b which is a binary matrix B. The document D1 is compared with the other documents D2, D3 and D4 with respect to the value. The document D4 has the value as 1, so the D1 and D4 are clustered together. Since D1 and D4 are clustered they should not enter into another cluster. Likewise, D2 is compared with D3 and D5, only D3 has the value 1, hence D2 and D3 are clustered. D5 acts as individual cluster. Similarly, above said steps are iterated until all documents are clustered.

3.9. Query Weighting Based On Clusters

Query weighting is the term given when a search engine adding search terms to a userâ€™s weighted search. The goal of query weighting in SBPE mechanism is to improve precision and /or recall.

Fig. 1. System architecture

Fig. 2. Comparison[Computing Time] between CEM and SBPE mechanisms

Information retrieval is the vital process on the web. The amount of data on the web is always increasing. In 1999, a survey report presented that Google had 135 million pages. It now has over 3 billion. Search engine follows specific mechanism trends with their searches. In our study the Information retrieval is conceded having Weighted Querying mechanism.

The Query is given by the user; Say ("Y"). On pre-processing the query, only the intrinsic terms are extracted. The searching process is accomplished on the documents present in the clusters to extract "Y". The variable b portrays the total number of documents which contains that querying term. The number of query terms present in each document in each cluster denoted as a and number of occurrences of each query term in that cluster denoted as c are quite calculated. Then compute the score Skj for all documents in each cluster. Add the scores of the documents present in a single cluster and find a value for each cluster. Thus, the probability metric value can be computed for all clusters. Finally, the largest score among all clusters and the query term is compared to the documents present in that particular cluster. Thus the documents are ranked accordingly. The pseudo code for the above said process is exposed in algorithm 2 as follows:

Algorithm 2. Query weighting in SBPE mechanism

where, Probability metric can be calculated as:

4. EXPERIMENTAL RESULTS

The experimental results of the proposed system are presented here to illustrate the efficacy of our fuzzy category based aggregation mechanism. The dataset what we have used in our experiments is the benchmark router dataset. The dataset is also trained a little for query processing. Initially feature selection is done by comprising HTML parser. The PoS tagger is used to stop words elimination; thereby significant words are extracted from the stemming process.

The word patterns are sorted by term frequency. From clustering using RMBC algorithm, a framework of clusters is obtained that is surmounted with the documents correlated to each other. The query weighting scheme is applied for Information retrieval. While performing the query weighting mechanism, decisive factor value is kept as 0.2 and the results are estimated for this value. It is healthier to keep the value of decisive factor less than 0.25. In future some other algorithms will be anticipated to synchronize the clustering process to similarity evaluations.

CEM process is taken in consideration in order to compare the results with our proposed mechanism. The experiment result reveals decreased computing time with respect to the data set towards our proposed SBPE mechanism depicted in Fig. 2. Five datasets are taken for processing where for all datasets the computing time is minimized.

CEM shows 0.38 for dataset1; subsequently for dataset it takes 0.28. Likewise 0.32 for dataset 3 according to SBPE and 0.43 for CEM. Comparatively the pictorial representation predicts that with the existing approach called CEM our SBPE Mechanism shows a considerable decrease with respect to the computing time and Fig. 3 Predicts the results. Fig. 3 Typifies the accuracy rate with respect to the proposed approach SBPE and CEM. The accuracy rate for SBPE shows quite well increased rate when compared with CEM.

Figure 4 exemplifies the memory usage assessment between CEM and SBPE Mechanism. For processing 1000 records in dataset1 CEM requires 20% of memory , SBPE consume 13% of memory, likewise proceeding 2000 records CEM necessitate 25% of memory, 17% is enough for SBPE. Thus the memory consumption is lesser with the SBPE mechanism.

Figure 5 Demonstrates the number of clusters formed with respect to the number of documents dispatched. 8 clusters are formed for 1000 documents, 9 clusters for 2000 documents. The cluster framing is attained by applying RMBC algorithm for clustering mechanism. The clustering accuracy is attained a good optimization. Furthermore this mechanism is a highly consumed mechanism with respect to memory; the cluster is also greatly achieved.

Fig. 3. Comparison [Accuracy] between CEM and SBPE mechanisms

Fig. 4. Comparison [Memory Usage] CEM and SBPE mechanisms

Fig. 5. Dataset Vs clusters

Fig. 6. Dataset Vs clustering accuracy

Fig. 7. Query Vs retrieval time

Figure 6 Pictorially represents that for 1000 documents the accuracy is up to 85% subsequently for 2000 documents accuracy is observed as 78%, Likewise follows for other datasets. Moreover corresponding observations are made regarding the query and its respected retrieval time. From Fig. 7 it is observed that the least time is found for query 5.

Thus a trained framework of clusters is acquired, which is beneficial for Information 317 retrieval. It is that time is saved as the searching process is concentrating on only the relevant cluster instead of entire search. The appropriate documents apprehensive to the clusters are only subjected to further processing.

5. CONCLUSION

In this study a merging path is provided between the clustering and information retrieval. Resultantly this mechanism is called as aggregation mechanism. At the outset this mechanism affords a fine clustering framework. This is furthermore applied to Information retrieval where first N documents matched with the query are retrieved by the query weighting approach based on cluster mechanism. Simulation results exemplify that time, memory is conserved towards this SBPE mechanism. Consequently the information retrieval accuracy is also increased too. Thus the proposed SBPE mechanism successfully clusters and ranks the documents based on its weighting approach. In future this work will be still scrutinized regarding its accuracy and system constraints such as memory, time. Another direction is to enhance this mechanism so as to pertain in ontology indexing ensemble with semantic relation between terms.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now