Introduction To Knowledge Discovery

Published Date: 02 Nov 2017

Chapter 1 INTRODUCTION

1.1. Introduction to knowledge discovery

Knowledge discovery in databases (KDD) together with data mining have been attracting a huge amount of researchers, industry, scientific and media in real-time applications. The increasing speed of the large and enormous databases in recent life makes it hard for the human being to analyze and extract the knowledgeable information from the data, even though the analysts use the classic statistical (data mining) tools. The eventual reason for this enormous amount of data is to attain the unidentified patterns for predicting the informative knowledge, which helps in decision making process. KDD is an inherently interactive and iterative process.

Data

Knowledge

Information

Figure 1. Discovery of knowledge.

Data is in the form of a string of bits, or numbers and symbols, or "objects" which are collected daily. Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data. Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our "mental pictures". Knowledge can be considered data at a high level of abstraction and generalization.

Knowledge discovery is considered as the non-trivial process of identifying the valid, novel, potentially useful and ultimately understandable patterns. The step-by-step analysis of this process is as follows:

C:\WINDOWS\Desktop\reduced.gif

Figure 2. Knowledge discovery process

Decision making process for the discovery of knowledge can only be achieved by incorporating the appropriate prior knowledge and interpretation of data as shown in Figure 2, rather than just going for the normal data mining techniques. This interpretation can be done by the tools and techniques in KDD, which is mostly needed for artificial intelligence and machine learning researchers. KDD includes techniques like description and prediction to extract the useful information.

Description provides the explicit information in a readable form and can be easily understandable by the user. Prediction is done based on the description of data, which is used to predict the future knowledge. It is easy to predict the knowledge from a small database. But while considering the large heterogeneous dataset, clustering is needed to achieve the better result in analyzing the data. To predict the results, a perfect filtering technique is also needed to overcome the complexities. For this filtering technique, the best choice is the interestingness metrics calculation.

1.1.1. Clustering

Clustering can be used as an unsupervised learning technique (learning from raw data); it deals with forming a similar grouping of data as a cluster. Cluster in a database can be defined as "A grouping of related items stored together for efficiency of access". This clustering can also be done in a more effective manner, when it deals with semi-supervised or supervised learning.

Clustering algorithms can be broadly classified into four categories:

Flat clustering â€“ This technique clusters the data without any explicit connection structure with other clusters. That is, there will be no relation among the clusters. Even though the efficiency is considered as the major factor here, predefined input, unstructured clustering and its non-deterministic nature are its limitations.

Hierarchical clustering â€“ When efficiency is not considered as a primary concern, we may go for the hierarchical clustering. This overcomes the drawbacks in flat clustering like its unstructured cluster and it is deterministic in nature. The incremental concept of this technique reduces the time complexity in dynamic dataset. It doesnâ€™t work well for clustering the overlapping data objects.

Hard clustering â€“ This technique clusters the data when the specified data exactly belongs to one cluster. Even though random partitioning is carried out, the resulting clusters will be more efficient and robust. Efficient clustering of overlapping points can be done and it improves the performance.

Soft clustering â€“ This is just contrary to hard clustering where the data may belong to more than one cluster.

Clusters can be validated by its quality and dynamic nature (scalability) of the databases.

Quality - Quality in clustering depends on the internal and external measures. Intra-Cluster tightness and Inter-Cluster separation falls in the internal quality measures, whereas the calculating the data points which wrongly falls in the cluster comes under external measures.

Scalability - Another factor which affects the cluster is the need for the periodic update of the data in the maintenance phase, which can be dealt with the incremental clustering concept.

1.1.2. Interesting metrics:

It is a well known fact that the data mining process can generate many hundreds and often thousands of patterns from data. The task for the data miner then becomes one of determining the most useful patterns from those that are trivial or are already well known to the organization. It is therefore necessary to filter out those patterns through the use of some measure of the patterns actual worth. To evaluate and rank the discovered patterns, interesting metrics are considered as a best solution.

Patterns

Ranked patterns

Patterns

Interestingness filter

Interestingness Filter

Data mining

Data Mining

Database

3. a 3.b 3.c

Figure 3 Techniques for Knowledge Discovery. Figure 3.a shows that all patterns produced by the data mining process are passed to the user. Figure 3.b shows how the search for interesting patterns occurs as post-processing effort. The preferred method would be to integrate the search for interesting patterns within the data mining algorithm as in 3.c

The approach taken in fig3.a is a simple data dredging approach that is cumbersome and time consuming because every pattern generated is presented to the user. In fig3.b a filter is used as a post-processor to filter out the interesting patterns according to some criteria or guidelines. In fig3.c a much more intelligent scheme is presented whereby the data mining algorithms are working in conjunction with the pattern assessment element to provide a much more dynamic approach to knowledge discovery. This typically involves an element of feedback as the search for interesting patterns proceeds. This is where the challenge to the data mining community lies and involves the development of interestingness measures and the integration of the user within the system.

The step-by-step analysis for the figure 3.c for finding the interestingness patterns along with data mining algorithms is shown in Figure 4. Once the mined patterns are formed from the data mining algorithms, the ranking and filtering are carried out simultaneously by the already calculated interesting metrics. The final output is the filtered out interesting patterns.

Interesting measures

Ranking

Mined Patterns

Interestingness patterns

Data Mining

Data

Filtering

Figure 4. Discovery of interesting patterns by interestingness metrics

Interestingness is perhaps best treated as a broad concept that emphasizes conciseness, coverage, reliability, peculiarity, diversity, novelty, surprisingness, utility, and actionability. These nine specific criteria are used to determine whether or not a pattern is interesting. They are described as follows.

Conciseness: A pattern is concise if it contains relatively few attribute-value pairs, while a set of patterns is concise if it contains relatively few patterns. A concise pattern or set of patterns is relatively easy to understand and remember and thus is added more easily to the userâ€™s knowledge.

Generality/Coverage: A pattern is general if it covers a relatively large subset of a dataset. Generality measures the comprehensiveness of a pattern, that is, the fraction of all records in the dataset that matches the pattern. An itemset is frequent if its support, the fraction of records in the dataset containing the itemset, is above a given threshold. Generality frequently coincides with conciseness because concise patterns tend to have greater coverage.

Reliability: A pattern is reliable if the relationship described by the pattern occurs in a high percentage of applicable cases. For example, a classification rule is reliable

if its predictions are highly accurate, and an association rule is reliable if it has high confidence.

Peculiarity: A pattern is peculiar if it is far away (outliers) from other discovered patterns according to some distance measure. Peculiar patterns may be unknown to the user, hence interesting.

Diversity: A pattern is diverse if its elements differ significantly from each other, while a set of patterns is diverse if the patterns in the set differ significantly from each other. According to a simple point of view, a summary can be considered diverse if its probability distribution is far from the uniform distribution.

Novelty: A pattern is novel to a person if he or she did not know it before and is not able to infer it from other known patterns. No known data mining system represents everything that a user knows, and thus, novelty cannot be measured explicitly with reference to the userâ€™s knowledge. Similarly, no known data mining system represents what the user does not know, and therefore, novelty cannot be measured explicitly with reference to the userâ€™s ignorance. Instead, novelty is detected by having the user either explicitly identifies a pattern as novel or notice that a pattern cannot be deduced from and does not contradict previously discovered patterns.

Surprisingness: A pattern is surprising (or unexpected) if it contradicts a personâ€™s existing knowledge or expectations. A pattern that is an exception to a more general pattern which has already been discovered can also be considered surprising. Surprising patterns are interesting because they identify failings in previous knowledge and may suggest an aspect of the data that needs further study. The difference between surprisingness and novelty is that a novel pattern is new and not contradicted by any pattern already known to the user, while a surprising pattern contradicts the userâ€™s previous knowledge or expectations.

Utility: A pattern is of utility if its use by a person contributes to reaching a goal. Different people may have divergent goals concerning the knowledge that can be extracted from a dataset.

Actionability/Applicability: A pattern is actionable (or applicable) in some domain if it enables decision making about future actions in this domain. Actionability is sometimes associated with a pattern selection strategy. So far, no general method for measuring actionability has been devised.

These nine criteria can be further categorized into three classifications:

Objective:

An objective measure is based only on the raw data. No knowledge about the user or application is required. Most objective measures are based on theories in probability, statistics, or information theory. Conciseness, generality, reliability, peculiarity, and diversity depend only on the data and patterns, and thus can be considered objective.

Subjective:

A subjective measure takes into account both the data and the user of these data. To define a subjective measure, access to the userâ€™s domain or background knowledge about the data is required. This access can be obtained by interacting with the user during the data mining process or by explicitly representing the userâ€™s knowledge or expectations. Novelty and surprisingness depend on the user of the patterns, as well as the data and patterns themselves, and hence can be considered subjective.

Semantic:

A semantic measure considers the semantics and explanations of the patterns. Because semantic measures involve domain knowledge from the user, some researchers consider them a special type of subjective measure. Utility and actionability depends on the semantics of the data, and thus can be considered semantic. Unlike subjective measures, where the domain knowledge is about the data itself and is usually represented in a format similar to that of the discovered pattern, the domain knowledge required for semantic measures does not relate to the userâ€™s knowledge or expectations concerning the data. Instead, it represents a utility function that reflects the userâ€™s goals. This function should be optimized in the mined results.

1.2. Project Description:

Discovery of knowledge from databases can be done by the Prediction process. Prediction is done by the "Decision-Tree Methodology"; whereas in this project "Knowledge Discovery using Multi-Source Combined Mining approach", an "Adaptive prediction method" is implemented where the rules are extracted from multiple data source as well as from different data servers. These rules are formed from each and each data source in a serial manner and the final extraction is done in a parallel way.

The pattern rules are formed in a two-step manner:

Cluster formation

Interesting patterns calculation.

Multi-Method clustering approach is used for the cluster formation where the clustering is done as an incremental method, by integrating the "Hard-Flat clustering" along with the "Hierarchical clustering" and analyzed the results of the Hard-Flat clustering with the proposed Incremental clustering. Hard-Flat clustering is well known for its simplicity and efficiency, whereas Hierarchical clustering is known for its flexibility to dynamic nature of datasets. This proposed "Incremental clustering" helps to reduce the time complexity and thus it best suits for the dynamic datasets; since re-clustering of the data can be avoided.

The interestingness patterns are also calculated and the results are filtered out by setting a threshold value. Apart from the commonly used interesting metrics like support, confidence and lift, this adaptive prediction method also includes the contribution and interesting rule to improve the quality of the pattern rules. The higher the value is, the more interesting the rule is. Multi-Featured approach is also included in this pattern formation; where the feature here represents the type of data. For this, numerical and categorical types of data are considered to form the rule. The numerical attributes are converted into categorical attributes for the metrics calculation.

Thus in this "Adaptive prediction", the data are processed and the resulting information is formed as patterns by means of the clustering technique and interestingness metric calculation. And finally the prediction is done in a parallel way by means of the extracted patterns from different data sources.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now