Knowledge Discovery In Databases

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

CHAPTER 1

INTRODUCTION

Data Mining has attracted a great deal of attention in the information industry and in society, due to wide availability of huge amounts of data. Turning such data into information and knowledge, will be a useful tool. The information and knowledge gained can be used for applications ranging from market analysis, fraud detection, customer retention, production control and science exploration etc [25].

Fayyad et al [47]. defined "data mining is a process of discovering valuable information from large amounts of data stored in databases, data warehouses, or other information repositories.

This valuable information can be such as patterns, associations, changes, anomalies and significant structures [47, 49]. That is, data mining attempts to extract potentially useful knowledge from data.

Data mining is used to develop techniques and tools for assisting experienced and inexperienced decision makers to analyze and process data for application purposes. On the other hand, the pressure of enhancing corporate profitability has caused companies to spend more time identifying diverse opportunities such as sales and investments. To this end huge amount of data are collected in their databases for decision-support purposes.

Data Mining involves an integration of techniques from multiple disciplines such as database and data warehouse technology, statistics, machine learning, high performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing and spatial or temporal data analysis.

Data Mining should have been more appropriately named as "Knowledge mining from data" or "Knowledge Discovery from Databases"

KNOWLEDGE DISCOVERY IN DATABASES (KDD)

Data mining has been popularly treated as a synonym for knowledge discovery in database, although some researchers view data mining as an essential part or step towards of knowledge discovery.

The emergence of data mining and knowledge discovery in databases as a new technology, has occurred because of the fast development and wide application of information and database technologies. Data mining and KDD are aimed at developing methodologies and tools which can automate the data analysis process and create useful information and knowledge from data to help in decision making. A widely accepted definition is given by Fayyad et al. [47] in which KDD is defined as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. This definition points to KDD as a complicated process comprising number of steps. Data mining is one step in the process.

The scope of data mining and KDD is very broad and can be described as a massive amount of fields of study related to data analysis. Statistical research has been focused on this area of study for over a century. Data mining and KDD draws upon methods, algorithms and technologies from these diverse fields, and the common goal is extracting knowledge from data.

Processing Steps of KDD

The process of knowledge discovery in databases consists of an iterative sequence of the following steps [15, 16, 12, 52 and 40]

Defining the problem: The goals of the knowledge discovery project must be identified. The goals must be verified as actionable. Generally, if the goals are met, a business can then put newly discovered knowledge to use and the same is followed in other applications.

Data preprocessing: Includes data collection, data cleaning, data selection, and data transformation.

Data collection: Obtaining necessary data from various internal and external sources, resolving representation and encoding differences, joining data from various tables to create a homogeneous source.

Data cleaning: Checking and resolving data conflicts, outliers (unusual or exceptional values), noisy or erroneous, missing data, and ambiguity, using conversions and combinations to generate new data fields such as ratios or rolled-up summaries. These steps require considerable effort in the process of knowledge discovery.

Data selection: Data relevant to an analysis task is selected from a given database. In other words, a data set is selected, or else attention is focused on a subset of variables or data samples, on which knowledge extraction is to be performed.

Data transformation: Data is transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations.

Data mining: An essential process, where intelligent methods are applied in order to extract data patterns. Patterns of interest in a particular representational form, or a set of such representations are searched for, including classification rules, regression analysis, clustering, sequence modeling, dependency, and so forth. The user can significantly aid the data mining method by correctly performing the preceding steps.

Post data mining: Includes pattern evaluation, deploying the model, maintenance, and knowledge presentation. It identifies the truly interesting patterns representing knowledge, based on available interesting measures, which tests the model for accuracy on another independent dataset that has not been used to create the model.

Deploying the model. The model is used to predict results for new cases. Then the prediction is used to alter organizational behavior. .

Maintaining. Maintaining models requires constant revalidation of the model, with new data, to assess if it is still appropriate for the organization.

Knowledge presentation. Visualization and knowledge representation techniques are used to present mined knowledge to users.

Feature Selection

The other important requirement concerning the KDD process is ‘Feature Selection’ [40, 52]. KDD is a complicated task and usually depends on correct selection of features. Feature selection is the process of choosing features which are necessary and sufficient to represent the data. There are several issues influencing feature selection, such as masking variables, the number of variables employed in the analysis and relevancy of the variables.

Masking variables is a technique which hides or disguises patterns in data. Numerous studies have shown that inclusion of irrelevant variables can hide real analysis of the data. So, relevant variables are only included in the data analysis to discriminate the data mining tasks.

The number of variables used in data mining is also an important consideration. There is generally a tendency to use more variables than perhaps necessary. However, increased dimensionality has an adverse effect because, for a fixed number of data patterns, it makes the multi-dimensional data space sparse.

Prior knowledge should be used if it is available, and mathematical approaches need to be employed for better results. Principal component analysis, which is a useful tool in data mining, is also very useful for reducing the dimension. However, this is only suitable for dealing with real-valued attributes. Mining association rules is also an effective approach in identifying the links between variables which take only categorical values. Sensitivity studies using feed forward neural networks are also an effective way of identifying important and less important variables. Jain, Murty and Flynn [1] have reviewed a number of clustering techniques which identify discriminating variables in data.

APPLICATIONS OF KNOWLEDGE DISCOVERY IN DATABASES

Data mining and KDD is potentially valuable in virtually any industrial and business sectors where database and information technology are used. Below are some of the applications of data mining.

Fraud detection: identifying fraudulent transactions.

Loan approval: establishing credit worthiness of a customer requesting a loan.

Investment analysis: predicting a portfolio’s return on investment.

Portfolio trading: trading a portfolio of financial instruments by maximizing returns and minimizing risks.

Marketing and sales data analysis: identifying potential customers; establishing the effectiveness of a sale campaign.

Manufacturing process analysis: identifying the causes of manufacturing problems.

Experiment result analysis: summarizing experiment results and predictive models.

Scientific data analysis.

Intelligent agents and WWW navigation.

DATA MINING TASKS

A data mining system may accomplish one or more of the following data mining tasks.

Class description. Class description provides a concise summarization of a collection of data and distinguishes it from other data. The summarization of a collection of data is known as "Class Characterization", whereas the comparison between two or more collections of data is called "Class Comparison" or "Discrimination". Class description should cover its summary properties on data dispersion, such as variance, quartiles, etc.

Association. Association is the discovery of association relationships or correlations among a set of items. They are often expressed in the rule form showing attribute-value conditions that occur frequently together in a given set of data. An association rule in the form of X → Y is interpreted as ‘database tuples that satisfy X is likely to satisfy Y ’. Association analysis is widely used in transaction data analysis for direct marketing, catalog design, and other business decision making process. Substantial research has been performed on association analysis with different forms of data like categorical, interval data, meta-pattern data. The efficient algorithms are proposed in the field of Association Rule Mining, to discover the rules from multi dimensional datasets, constraint based rule mining and level wise rule mining.

Classification. Classification analyzes a set of training data (i.e., a set of objects whose class label is known) and constructs a model for each class, based on the features in the data. A decision tree, or a set of classification rules, is generated by such a classification process which can be used for better understanding of each class in the database and for classification of future data. There have been many classification methods developed such as in the fields of machine learning, statistics, databases, neural networks and rough sets. Classification has been used in customer segmentation, business modeling, and credit analysis.

Prediction. This mining function predicts the possible values of certain missing data, or the value distribution of certain attributes in a set of objects. Regression analysis, generalized linear model, correlation analysis and decision trees, Genetic algorithms and neural network models have been useful tools in quality prediction.

Clustering. Clustering analysis identifies clusters embedded in the data, where a cluster is a collection of data objects that are "similar" to one another. Similarity can be expressed by distance functions, specified by users or experts. A good clustering method ensures that the inter-cluster similarity is low and the intra-cluster similarity is high.

Time-series analysis. Time-series analysis analyzes large set of time series data to determine certain regularity and interesting characteristics. This includes searching for similar sequences or subsequences, and mining sequential patterns, periodicities, trends and deviations.

DATA MINING TECHNIQUES

Data mining methods and tools can be categorized in different ways [25, 46 and 47]. They can be classified as clustering, classification, dependency modeling, summarization, regression, case based learning, and mining time-series data, depending on functions and application purposes. Some methods are traditional and established, while some are relatively new.

In general, data mining techniques can be classified into two categories: descriptive data mining and predictive data mining. The former describes the data set in a concise and summary manner and presents interesting general properties of the data, whereas the latter constructs one, or a set of, models, performs inference on the available set of data, and attempts to predict the behavior of new data sets [15, 16, 25, 40, 46 and 48].

Clustering

Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). Clustering aims to devise a classification scheme for grouping the objects into a number of classes such that instances within a class are similar, but distinct from those from other classes. Grouping multi-variant data into clusters according to similarity or dissimilarity measures is the goal of some applications.

Classification

The task of classification is to assign unknown data patterns to the established classes. The most widely used classification approach is based on feed-forward neural networks. Classification is also known as supervised machine learning because it always requires data patterns with known class assignments to train a model. This model is then used for predicting the class assignment of new data patterns [34].

Decision Tree Based Classification

A decision tree is a model that is both predictive and descriptive. The visual presentation makes a decision tree model very easy to understand and assimilate. The decision tree has become a very popular data mining technique. Decision trees are most commonly used for classification (i.e., for predicting what group a case belongs to), but can also be used for regression (predicting a specific value).

Naive-Bayes Based Classification

Naive-Bayes is a classification technique that is both predictive and descriptive. It analyzes the relationship between each independent variable and the dependent variable to derive a conditional probability for each relationship. When a new case is analyzed, a prediction is made by combining the effects of the independent variables on the dependent variable (the outcome that is predicted).

Nearest Neighbor Based Classification

Nearest Neighbor (more precisely k-nearest neighbor, also k-NN) is a predictive technique suitable for classification models. Unlike other predictive algorithms, the training data is not scanned or processed to create the model. Instead, the training data is the model

As the term ‘nearest’ implies, k-NN is based on a concept of distance. This requires a metric to determine distances. All metrics must result in a specific number for the purpose of comparison. Whatever metric is used, it is both arbitrary and extremely important. It is arbitrary because there is no preset definition of what constitutes a ‘good’ metric. It is important because the choice of a metric greatly affects the predictions. Different metrics, used on the same training data, can result in completely different predictions. This means that a business expert needs to determine a good metric.

iv) Neural Networks Based Classification

Neural networks are based on an early model of human brain function. Although they are described as ‘networks’, a neural net is a mathematical function that computes an output based on a set of input values. The network paradigm makes it easy to decompose the larger function to a set of related sub-functions, and it enables a variety of learning algorithms that can estimate the parameters of the sub-functions. The output from a neural network is purely predictive, because there is no descriptive component to a neural network model. Neural nets are used in applications such as handwriting recognition or robot control.

The neural nets only operate on numbers. As a result, any non-numeric data in either the independent or dependent (output) columns must be converted to numbers before the data can be used with a neural net.

Conceptual Clustering and Classification

Conceptual clustering and classification, develops a qualitative language for describing the knowledge used for clustering. It is basically in the form of production rules or decision trees which are explicit and transparent. The inductive system C5.0 (previously C4.5) is a typical approach. It is able to automatically generate decision trees and production rules from databases. Decision trees and rules have a simple representative form, making the inferred model relatively easy to comprehend by the user.

Dependency Modeling

Dependency modeling describes dependencies among variables. Dependency models exist at two levels: structural and quantitative. The structural level of the model specifies (often in graphical form) which variables are locally dependent. The quantitative level specifies the strengths of the dependencies, using some numerical scale. The dependency modeling include probabilistic (or Bayesian) graphs and fuzzy digraph graphs.

Probabilistic graphical models are very powerful representation schemes which allow for fairly efficient inference and for probabilistic reasoning. Other dependency modeling approaches include statistical analysis like correlation coefficients, principal component and factor analysis, and sensitivity analysis using neural networks.

Summarization

Summarization provides a compact description for a subset of data. Simple examples would be the mean and standard deviations. More sophisticated functions involve summary rules, multi-variant visualization techniques, and functional relationships between variables. A notable technique for summarization is that of mining association rules. Given a database, the association rule mining techniques finds all associations of the form:

IF {set of items} THEN {set of items}

If A and B are set of items in a transaction set. The above rule can be represented as AB. Rule support and confidence are two measures of rule interestingness. They reflect the usefulness and certainty of discovered rule. A support is a percentage of transactions that contain both A and B. Confidence is a percentage of transactions containing A that also contain B. Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong rules. Such thresholds can be set by users or domain experts.

Regression

Linear (or non-linear) regression is one of the most common approaches used for correlating data. Statistical regression methods often require the user to specify the function over which the data is to be fitted. In order to specify the function, it is necessary to know the forms of the equations governing the correlation for the data. The advantage of such methods is that it is possible to gain from the equation, some qualitative knowledge about input-output relationships.

Case-Based Learning

Case-based learning is based on acquiring knowledge represented by cases. It employs reasoning by analogy. Case-based learning focuses on the indexing and retrieval of relevant precedents. Case-based reasoning is particularly useful for utilizing data which has complex internal structures. Differing from other data mining techniques, it does not require a large number of historical data patterns.

Mining Time-Series Data

Many industries and businesses deal with time-series or dynamic data. It is apparent that all statistical and real-time control data used in process monitoring and control is essentially time-series. Time series data can be dealt with by carrying out preprocessing of the data in order to use minimum data points to capture the features and remove noise.

DATA MINING AND MARKETING

Data mining has become widely recognized as a critical field by companies of all types. The use of valuable information ‘mined’ from data is recognized as necessary to maintain competitiveness in today’s business environments. With large amounts of data at a common place and increasing computing power, business industry is now looking for technology and tools to extract usable information from detailed data.

Data mining has received the most publicity and success in the fields of database marketing and credit-card fraud detection. In database marketing, great accomplishments have been achieved in the following areas.

Response modeling: predicts which prospects are likely to buy, based on previous purchase history, demographics, Geographic’s, and life-style data.

Cross-selling: maximizes sales of products and services to a company’s existing customer base by studying the purchase patterns of products frequently purchased together.

Customer valuation: predicts the value or profitability of a customer over a specified period of time based on previous purchase history, demographics, geographic, and life-style data.

Segmentation and profiling: improves understanding of a customer segment through data analysis and profiling of prototypical customers.

SOLVING REAL-WORLD PROBLEMS BY DATA MINING

One of the most popular and successful applications of database systems is in the area of marketing where a great deal of information about customer behavior is collected. Marketers are interested in finding customer preferences so as to target them in their future campaigns [30, 46].

The following brief description of several existing knowledge-discovery systems exemplifies the nature of the problems being tackled and helps to visualize the main design issues arising therein.

The SKICAT (Sky Image Cataloging and Analysis Tool)

This system concerns an automation of reduction and analysis of the large astronomical dataset. The SKICAT system integrates techniques for image processing, data classification, and database management. The goal of SKICAT is to classify sky objects which have been too faint to be recognized by astronomers.

Health-KEFIR (Key Findings Reporter) is a knowledge discovery system used in health-care as an early warning system [46]. The system concentrates on ranking deviations according to measures of how interesting these events are to the user. The system performs an automatic drill-down through data along multiple dimensions to determine the most interesting deviations relative to previous and expected values.

TASA (Telecommunication Network Alarm Sequence Analyzer) was developed for predicting faults in a communication network [46]. A typical network generates hundreds of alarms per day. TASA system generates rules like ‘if a certain combination of alarms occurs within time then an alarm of another type will occur within time’. The time periods for the ‘if’ part of the rules are selected by the user, who can rank or group the rules once they are generated by TASA.

R-MINI system uses both deviation detection and classification techniques to extract useful information from noisy domains [46]. It uses logic to generate a minimal size rule set that is both complete and consistent.

Knowledge Discovery Workbench (KDW) This is a collection of methods used for interactive analysis of large business databases. It includes many different methods for clustering, classification, deviation detection, summarization, dependency analysis, etc. The user guides the system in carrying out the searches.

TRENDS OF DATA MINING

Data mining tools operate on data, so we can expect to see algorithms move closer to the data. The major advantage that data mining tools have over traditional analysis tools is that they use computer cycles to replace human cycles. The market will continue to build on that advantage with products that search larger and larger spaces to find the best model. This will occur in products that incorporate different modeling techniques in the search. It will also contribute to ways of automatically creating new variables, such as ratios or rollups.

There have been many data mining systems developed in recent years. This trend of research and development is expected to continue to flourish because of the huge amount of data which have been collected in databases, and the necessity to understand research and make good use of, such data in decision making. This serves as the driving force in data mining.

The diversity of data, data mining tasks, and data mining approaches pose many challenging research issues. Important tasks presenting themselves for data mining researchers and data mining system and application developers are listed below [25]:

establishing a powerful representation for patterns in data

designing data mining languages

developing efficient and effective data mining methods and systems

exploring efficient techniques for mining multi-databases, small databases, and other special databases

constructing of interactive and integrated data mining environments and

Applying data mining techniques to solve large application problems.

When a large amount of inter-related data is effectively analyzed from different perspectives, it can pose threats to the goal of protecting data security and guarding against the invasion of privacy. It is a challenging task to develop effective techniques for preventing the disclosure of sensitive information in data mining. This is especially true, when the data mining system is used in a vast array of areas like Financial Data Analysis, Retail Industry, Telecommunications Industry, Biological Data Analysis, Intrusion Detection, etc.

FREQUENT PATTERNS AND ASSOCIATION RULE MINING

Frequent Patterns are those that appear in a dataset frequently. A set of items that appear frequently together in a transaction dataset is called frequent itemset. A subsequence occurs frequently in a shopping history database is a frequent sequential pattern. A substructure, like subgraphs, subtrees or sublattices, occurs frequently, it is called frequent substructure patterns. Finding such frequent patterns plays an essential role in mining associations and correlations. Thus frequent pattern mining has become an important data mining task and a focused theme in data mining.

Frequent itemset mining leads to the discovery of association rules among items in large transactional dataset. The discovery of interesting associations or relationships among huge amounts of business transactions can help in many business decision-making processes, such as catalog design, cross-marketing and customer shopping behavior analysis.

An example of frequent itemsets mining is market-basket analysis. This process analyzes customer buying habits by finding associations between the different items that customers place in their "shopping baskets". The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. Market basket analysis may help to design different store layouts. In one strategy, items that are frequently purchased together can be placed in proximity in order to further encourage the sale of such items together. The results from this analysis can be used to plan marketing or advertise strategies or in the design of a new catalog.

If universal sets of items are available at store, then each item has a Boolean variable representing the presence or absence of that item. Each basket can then be represented by a Boolean vector assigned to these variables. The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or purchased together. These patterns can be represented in the form of Association Rules- an association rule is in the form of XY, where X and Y are itemsets.

Data mining system has the potential to generate thousands or millions of patterns or rules. In these rules, only a small set of potentially generated patterns would be interest to any user. A pattern can be interesting if it is [25]

Easily understood by humans

Valid on new or test data with some degree of certainty

Potentially useful and novel.

Validates a hypothesis that the user sought to confirm.

In the discovery of Association Rules, interestingness is an important feature to retrieve the quality association rules. There are two types of interestingness measures for Association Rules called Objective interestingness measures and Subjective interestingness measures. The objective interestingness measure is based on the structure of discovered patterns and the statistics underlying them. The objective measure for association rules of the form XY is support. The support is a percentage of transactions that contain both X and Y. This is taken to be the probability, P(XY). Another objective measure for association rules is confidence. Confidence is a percentage of transactions containing X that also contain Y. This is taken to be the conditional probability, P(Y/X). They reflect the usefulness and degree of certainty of the discovered association rule. Association rules are considered interesting if they satisfy both a minimum support threshold and minimum confidence threshold. Such thresholds can be set by users or domain experts [25].

The process of mining association rules consists of two parts. First, we have to identify all the itemsets contained in the data that are adequate for mining association rules. These combinations have to show at least a certain frequency to be worth mining and are thus called frequent itemsets. The second step will generate rules out of the discovered frequent itemsets [25].

Mining Frequent Patterns:

Mining Frequent Patterns from a given dataset is not a trivial task. All sets of items that occur at least as frequently as a user-specified minimum support have to be identified at this step. An important issue is the computation time because when it comes to large databases there might be a lot of possible itemsets all of which need to be evaluated. Different algorithms allow discovering efficient frequent patterns.

Discovering Association Rules

After generating all patterns that meet the user specified minimum support, rules can be generated from them. For this a minimum confidence has to be defined. The task is to generate all possible rules in the frequent itemsets and then compare their confidence value with the minimum confidence. All rules that meet this requirement are considered as interesting. Frequent sets that do not include any interesting/quality rules, those don not have to be considered.

Han et. al [21] says "Mining Frequent itemsets from a large dataset often generates a huge number of itemsets satisfying the minimum support threshold(min_sup), especially when min_sup is set to low".

For this reason, to retrieve the quality, usable patterns and identification of the patterns which are interesting or not is the prime goal. But the above mentioned measures may have the following difficulties.

Objective measures rely on a user’s ability to choose the right measure out of a huge set of available ones. Tan et al.[43], empirically shown that some measures produce similar rankings while others almost reverse the order. This poses the problem of choosing the right measure for a given problem or application. Most measures lack of interpretability and meaningfulness because the rule properties they measure rarely reflect the practical considerations of a user. For a user it is unclear that which measure has to choose and how to link its results to his application. Consequently many rules deemed to be interesting will not be very useful. Because objective measures are unable to identify present discovered patterns, and previously discovered patterns. This ability, in turn, is crucial for distinguishing novel patterns from prevalent ones which often represent domain knowledge and thus are of less interest.

Subjective Interesting measures are based on user beliefs in the data. A lot of effort is necessary to collect, organize and finally incorporate domain knowledge into a knowledge base. Building a knowledge base can also become a task never to be finished. During the knowledge acquisition process domain knowledge may become outdated, invalid, or loose its relevance or new knowledge may evolve. Users almost always have partial awareness about this knowledge ageing process. Because of these knowledge dynamics it is often difficult to obtain a complete knowledge base. In contrast, subjective approaches treat domain knowledge as something static that never changes. Hence, they do not account for the ageing of knowledge nor do they support the user in maintaining it. Consequently, there is a risk that patterns are interesting, based on outdated knowledge while a user is being left uninformed about the out datedness itself.

One of the major issues facing frequent pattern mining is that it can (and often does) produce an unmanageable number of patterns. Frequent pattern mining algorithms try to identify all the patterns which occur more frequently than a minimal support threshold in the desired datasets and most of them may be redundant patterns thus the generated rules may be redundant. Then, it is impossible for data analyst or any domain experts to manually go over such a large collection of patterns. Indeed, reducing the number of frequent patterns has been a major theme in frequent pattern mining research. Much of the research has been on itemsets, itemsets data can be generalized to many other pattern types. One general approach has been to mine only patterns that satisfy certain constraints, well-known examples include mining maximal frequent patterns [27], closed frequent patterns [34] and non-derivable itemsets [45, 54]. The last two methods are generally referred to as lossless compression since we can fully recover the exact frequency of any frequent itemsets. The first one is lossy compression since we cannot recover the exact frequencies. Xin et al. [53] generalize closed frequent itemsets to discover a group of frequent itemsets. If one itemsets is a subset of another itemsets and its frequency is very close to the frequency of the latter superset, then the first one is referred to as covered by the latter one. However, the patterns being produced by all these methods are still too numerous to be very useful. Even the Xin et al [53] method easily generates thousands of itemset patterns. However, these methods generally do not provide a good representation of the collection of frequent patterns.

For these reasons, generating strong association rules to solving real world problems is very often restricted by the quality of the rules. However, the quality of the extracted rules has not drawn adequate attention [38, 55, 56, and 58]. Measuring the quality of association rules is also difficult and current methods appear to be unsuitable, when multi-level (rules whose items come from one taxonomy level, but the set of rules span more than one taxonomy level) and cross level (rules whose items topics come from more than one taxonomy level) rules are involved. Association Rules generated from mining data at multiple levels of abstraction are called multi/cross level or multilevel association rules. Multilevel association rules can be mined using concept hierarchies under a support–confidence framework. An undiscovered knowledge can be discovered with multi/cross level, which could not be found by the single level approach and this new knowledge may be highly relevant or interesting to a given user [5, 54]. In fact, the multi-level rule mining is useful in discovering new knowledge, which is missed by conventional algorithms [55].

The Multi/Cross level Association Rules can be mined under support confidence framework considering uniform-support, reduced-support and group-based support. In uniform minimum support, the same support threshold is used when mining at each level of abstraction. The uniform support approach has some difficulties. Items at low level of abstraction will occur as frequently as those at higher levels of abstraction. If minimum support threshold is set too high, it could miss some meaning full associations occurring at low abstraction levels. If the threshold is set too low, it may generate many uninteresting associations occurring at high abstraction levels. This provides the motivation for reduced-support.

In reduced-support, each level of abstraction has its own minimum support threshold. At the lowest level of abstraction, the threshold will be very low. That too low threshold may generate more number of rules, which may be redundant. Thus we may consider the group-based support.

In group-based support, users can have insight of which groups are important than others. Thus, it is desirable to set user specific, item or group based minimal support threshold when mining multilevel rules.

A series side effect of mining multilevel association rules is its generation of many redundant rules across multiple levels of abstraction due to the ancestor relationships among items.

So, most of the successful applications are restricted to cases where the datasets involve only a single concept level and the success of the application is heavily dependent on the quality of the discovered rule set. Mining quality non redundant single/multi-level association rules from single/multi-level datasets is a challenge still needing to be worked on and it is a desired goal for helping to solve real world problems is considered as proposed work in this thesis.

MOTIVATION

In this section, motivations for choosing the problem as a thesis project are explained.

Association Rule Mining is very important task in the Data Mining. It is a summarization technique of Data Mining. The existing methodologies for discovering the Association Rules focus on the generating the rules or rule set. For a large dataset, a huge number of rules can be derived, but many of them can be redundant to other rules and thus are useless in practice. The extremely large number of rules makes it difficult for the end users to comprehend and therefore effectively use the discovered rules and thus significantly reduces the effectiveness of rule mining algorithms. If the extracted knowledge can’t be effectively used in solving real world problems, the effort of extracting the knowledge is worth little.

And also, the traditional approaches deals with Single level Association Rule Mining. All the existing work is related to single level association rule mining. There is a little work associated with Multi Level Association Rule Mining. Data is spread into many hierarchical levels then the discovered rules are also span across two or more levels. For such multi-level datasets, the issue of deriving rules is even more serious. The presence of multiple concept levels in a dataset means that items and patterns will be at different levels, which increases the number of rules that can be discovered, and many of them may be redundant.

For these reasons, the proposed work is to make a contribution in the field of association rule mining to discover efficient, quality and useful rules, in which the redundant rules are avoided, so that the extracted knowledge can be used in a efficient manner to solve real world problems. The proposed work also considers discover of Association Rules using multi-level dataset, so that the discovered rules are more specific at different level of data and the extracted knowledge can be used efficiently in many real world problems. This study has investigated and developed effective methods for mining non-redundant association rules from single and Multi Level Association Rules.

Even if rules can be efficiently and effectively discovered, it is desired that they are also of high quality. Poor quality rules will lead to poor quality decisions and outcomes being made and this has negative outcomes for all. Determining the quality of association rules can be difficult because the measure of quality can be either an objective measure or a subjective measure.

In General, the most common techniques used to evaluate the performance of association rule mining algorithms are based around the number of rules discovered and their associated support and confidence values. However, it is argued by Han & Kamber [21] that not all ‘strong’ rules i.e. those that satisfy the minimum support and confidence thresholds are interesting [50]. This means the quality or interestingness of the discovered rules could be questionable.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now