Data Mining To Improve The Efficiency

Published Date: 02 Nov 2017

Abstract: It is a significant challenge to search, comprehend and use the semi-structured HTML, XML, database-service-engine information stored on the web. This data is more complex and dynamic than commercial databasesâ€™ data. Data mining has been applied to web-page ranking to supplement keyboard-based indexing. In this context, data mining improves the quality and efficiency of search results. For the web to be at its best, we must improve its usability. Data mining can play a vital role in the development of intelligent web. It will make the web a more exhaustive, intuitive and intelligent, usable resource. The paper shows how data mining can be applied to discover and catalog important links and patterns that will make our web interactions directed and intelligent.

Introduction: An exhaustive and inimitable source for data mining techniques is provided by the web which is an enormous and vigorous repository of pages including unquantifiable hyperlinks and colossal amount of access and usage statistics. However, numerous problems must also be considered that obstruct useful resource and knowledge discovery:

The sheer structural and design intricacy of large databases associated with a web page is far telling in its nature compared to conventional repositories of text files. Several non-uniform structural layouts, styles of authoring and variety in content are intrinsic components of web pages. Also, lack of indexing complicates the task of efficiently searching data.

The nature of information stored in the web database is highly volatile and dynamic in nature in terms of both, additions and updates as well as linkage framework and access records, e.g., News, Sports, Stock market web sites.

The users of the internet differ greatly in their usage, interests and backgrounds which include wont of knowledge of structural complexities of information networks, heavy cost incurred by searches and the users also get swayed away by the huge amount of information the web database has to offer.

The required, true and relevant search results are only present in a very small amount relative to the total information that the web has to offer or that which gets dug up when searches are made, thus slogging the entire process down due to unimportant data flooding.

The question remains â€“ "How can a search find truly relevant, concentrated information and of distilled quality?"

As of today, there are primarily three ways in which a user can approach towards accessing web information:

Keyword-based: Uses keyword indexing or manual directories to search for documents.

Deep querying: Used where a large catalog of information or data is stored and is hidden behind database query forms, otherwise inaccessible through non-dynamic URL links, e.g., amazon.com

Random surfing of web pages through hyperlinks.

The above techniques have succeeded considerably and this fact goes on to indicate the great potential of the internet to become the most exhaustive information repository.

Concerns in Design: It has been quite a major concern in research to realize an intelligent Web and the achievement of this has overcoming of two intrinsic problems as its prerequisites:

The conventional methodologies for accessing the extremely voluminous data on the Web inherently adopt the keyword-based view at the level of abstraction.

Replacement of prevailing crude access methodologies by more intuitive versions must be implemented at the service level.

Confines in Access: Data miningâ€™s role in the realization of intelligent Web will be very important despite the support for information searches by keyword, address and topic-based Web searches because the Web in its present form is unable to provide quality services which are also intelligent. In this regard, the factors under consideration which have steered the inspiration behind the research are:

Inefficient searches implementing the keyword-based algorithm, which are tainted with problems like

Returning an unreasonable and ineffective amount of information if the keywords belong to frequently popular categories like sports, etc.

Returning poor results in case of semantic overload and context-based discrepancies.

Missing highly relevant and effective results just because of absence of a particular keyword although that article is extremely informative in the context searched.

Mediocre quality and effectiveness in deep Web-querying despite the presence of well-structured, well-designed and richly populated databases, due to the inability of web crawlers to query them. This ultimately leads to perpetually invisible but actually quite relevant information. To overcome this, we must integrate the heterogeneous databases having different individual query-supports that comprise the hugely informative Web.

Manually constructed directories that are topic-oriented or type-based are highly appropriate as they represent a Web portion in an organized manner and also support searches based on semantics. This leads to efficient searching but such directories are costly, can provide only curtailed exhaustion and suffer in both scalability and adaptability.

Absence of semantic queries that are tried by the developers to be replaced or substituted only with keyword-based algorithms with certain additional features like "match the exact phrase" or "match all the words" or "match only a single word" etc. but even then these donâ€™t make up for the inefficiency caused due to the lack of these queries.

Feedback of usersâ€™ activities and usage statistics help enormously in observing collective behavior to identify the authentic, authoritative and high-quality web pages. However, as human activities are extremely temporal, the links in the Web must also be dynamic enough to support the flow.

There is limited support for multidimensional analysis and data mining due to the predominantly keyword-based searching and consequently, curb the possibilities of operations like running queries to list data mining centres and then focus on those having high number of papers and analyse their changes.

The above confines have been a great incentive for researchers to develop ways to efficiently, effectively and accurately mine and discover internet resources.

Tasks in Mining for Efficient Knowledge Discovery: Further discussed are the problems in research that must be solved in order to implement effective development of intelligent Web.

Dredging search-engine data: How an index-based search algorithm works is that it crawls the pages, arranges them in indices, constructs and stores enormous indices based on keywords that assist in finding the web pages with the specified keywords. Hence by using highly specific and conditional keywords, one can quickly and easily locate relevant results.

But there are some concerning limitations inherent in the above mentioned algorithm:

A specific topic may contain innumerable document entries containing the keyword(s) which may lead to returning of a large number of results out of which most are quite irrelevant or poor in quality.

There might me many documents which donâ€™t contain the keywords that specifically define the mentioned or searched topic. This may lead to omission or rejection of search results that might have been otherwise useful and quite relevant to the user.

As a result, we propose to integrate data mining techniques with the search engine in order to improve the quality of search results. As a first step in this direction, we can proceed by dilating the keyword set to accommodate their synonyms and contextual partners also. For example, a search for the term "Jaguar" can include keywords to be searched for as "Jaguar car", "Jaguar animal" or a search for "data mining" may include "data dredging" or "knowledge discovery" etc. This will give a larger result set of documents which can be further filtered for highly relevant results and then provided to the user finally. An analysis of web linkages and dynamics thus provides a strong foundation for high quality knowledge discovery.

Web linkage framework analysis: The results returned to the user must not only be topic-specific but also, authoritative and of high quality. A direct measure of the authoritativeness of web pages can be determined by analyzing web page linkage frameworks. Hyperlinks in web pages comprise a large amount of hidden human comments and annotations that prove to be helpful in automatically deducing the authority. For example, if a web page author creates a hyperlink on his/her page pointing to another page it can be considered as an endorsement or a certificate of authenticity of that page. In this way, the recursive and collective framework of hyperlinks can lead us to authoritative web pages. But there are some problems posed in this notion as mentioned below:

Not every hyperlink can be considered as an endorsement as it might have been created for other purposes like advertising, navigation, etc.

Generally, the owner of a web page will not choose to put a hyperlink to his/her competitorâ€™s web page on his/her page despite its authenticity.

Lack of detailing in web pages that are highly authoritative.

Here comes an important concept called "Hubs". This term refers to a web page or a set of web pages that contain links to authorities. Although slim, it does provide links to relevant sites about a given topic. A hub can comprise either the links that are recommended on individual pages or catalogued by a third-party site. Thus, it helps to find authoritative web pages in that if a web page is pointed to by many hubs then it is quite authoritative and if a good hub points to web pages, those web pages are authoritative. This will help in mining of authoritative web pages and high-quality knowledge discovery. A direct consequence and practical example of identification of hubs and authoritative web pages are "PageRank" and "HITS" algorithms.

Automated categorization of documents on the Web: Automatic classification, in comparison to human readers, is highly desirable due to less cost and improved speed, when categorizing web documents. Conventionally, the categorization algorithms incorporate the use of training sets viz. a viz. positive and negative examples and assigning each document a category based on fixed categories and documents. For example, the taxonomy of Rediffmail and its documents can be used as the training and test sets to realize a classifier which can then be used to classify other web documents.

Satisfactory results can be obtained by using conventional classification techniques like Bayesian classification, decision-trees, association rules, etc. to classify web pages. The semantic information associated with hyperlinks can be used to achieve even better results than the previously mentioned classification algorithms.

However, ingenuous use of terms in the neighbourhood of a documentâ€™s hyperlinks may lead to poor accuracy due to the presence of noisy and irrelevant data in pages which are a part of back linking, e.g., advertisements, etc.

Generally, automatic classification is based only on positive and not negative data sets to avoid easy exclusion of relevant data.

Mining of semantic constructs and contents: Different web pages might have different semantic structures, e.g., a departmentâ€™s page vs. a professor page. Firstly, identification of relevant extractable structure is required using either manual or programmatic methods of induction from a set of document examples. Secondly, this information can be used to implement automatic extraction from information database. Also, it will lead to clearer and deeper analysis of database contents.

Mining the volatile changes in Web pages: Identification, recognition and mining of the volatile and dynamic changes in web pages can be done based on three categories â€“ content, structure and patterns of access. Storage of the web data in chronological order helps to detect changes easily. But the exhaustive and massive nature of the information stored on the Web makes it difficult to maintain records in an orderly fashion. Thus, we resort to the mining of accesses to ensure better results delivered to the user. This can be implemented using mining of the logs or history of records in the web. The homogeneity in the patterns of access in terms of IP address, the website address and the time of the access can help us to identify the most requested web pages, the most frequent time of access and the most frequent users and the data populated in such a way can be mined for extracting knowledge using OLAP operations. This algorithm can prove to be very beneficial for fields like business, market research and customer service support.

Incorporating Multidimensional Web: Just like we apply mining techniques like clustering and outlier analysis to data in data warehouses, we apply these techniques to sets of web pages treating them like data. The web pages are grouped into clusters if they are tightly bound in terms of data, access patterns or structure and following this, we create descriptors for semantic analysis and record-keeping of these web pages. These descriptors are semantic in nature and can be used for the construction of evolving and dynamic web directories for the realization and implementation of data in multiple layers and dimensions. After this, the next step is to apply abstraction to the various layers such that every layer is a further abstracted version of the preceding layer but preserving the relevant, special and characterizing features which would facilitate easy and simple yet accurate search results and in less time since the search would proceed from only the highly abstracted layers to the deeper ones.

Other Methods for Mining and Efficiency Improvement: Methods like building personalized search service and web servers based on the userâ€™s web navigation history proves to be of great accuracy. This can be done by profiling using the userâ€™s history and then building the database accordingly.

Statistical Importance of Search Engine Optimization (SEO):

Marketing Sherpa reports distributionÂ lead to a 2,000% increase in blog traffic and a 40% increase in revenue.

70-80% of users ignore the paid ads,Â focusing on the organic results.

75% of usersÂ never scroll past the first pageÂ of search results.

Search and e-mail are theÂ top two internet activities.

Companies that blog have 434% more indexed pages and companies with more indexed pages get far more leads.

81% of businesses consider their blogs to be an important asset to their businesses.

A study by Outbrain shows that search isÂ the #1 driver of traffic to content sites, beating social media by more than 300%

79% of search engine users say they always/frequently click on the natural search results. In contrast, 80% of search engine users say they occasionally/rarely/never click on the sponsored search results. Hereâ€™s a look at what the natural (blue) and sponsored search results (red) look like:

42% of search users click the top-ranking link. 8% click the second-ranking link, and the click-through rate (CTR) continues to drop thereof.

80% of unsuccessful searches are followed with keyword refinement.

41% of searches unsuccessfulÂ after the first pageÂ choose to refine their keyword search phrase or their chosen search engine.

Daily use of search engines rose from 33% in 2002 to 59% in 2005. The average day in 2005 reported 60 million people using a search engine. As of March 2007, Google accounts for 64% of US searches and 77% of UK searches

Conclusion: Data mining for improving the efficiency of search engines will prove to be an effective impetus for research in SEO technology. It will be one that facilitates exhaustive and comprehensive usage of the data available in large databases or on the web. However, it is essential that we overcome the limitations and the challenges in this field that have previously been discussed in order to make a friendlier and more efficient search engine delivering high quality results.

References: S. Brin and L. Page, "The Anatomy of a Large-Scale

Hypertextual Web Search Engine," Proc. 7th Intâ€™l World Wide Web Conf. (WWW7), ACM Press, New York, 1998, pp. 107-117.

J. Srivastava et al., "Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data," SIGKDD Explorations, vol. 1, no. 2, 2000, pp. 12-23.

S. Chakrabarti et al., "Mining the Webâ€™s Link Structure," Computer, Aug. 1999, pp. 60-67.

R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, Reading, Mass., 1999.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now