Mining Methodology And User Interaction

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Although OLAP tools support multidimensional analysis and decision making, additional data analysis tools are required for in-depth analysis, such as data classification, clustering, and the characterization of data changes over time. In addition, huge volumes of data can be accumulated beyond databases and data warehouses.Typical examples include the World Wide Web and data streams, where data flow in and out like streams, as in applications like video surveillance, telecommunication, and sensor networks. The effective and efficient analysis of data in such different forms becomes a challenging task.

The abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation. The fast-growing, tremendous amount of data, collected and stored in large and numerous data repositories, has far exceeded our human ability for comprehension without powerful tools.As a result, data collected in large data repositories become ―data tombs‖—data archives that are seldom visited. Consequently, important decisions are often made based not on the information-rich data stored in data repositories, but rather on a decision maker‘s intuition, simply because the decision maker does not have the tools to extract the valuable knowledge embedded in the vast amounts of data.

In addition, consider expert system technologies, which typically rely on users or domain experts to manually input knowledge into knowledge bases. Unfortunately, this procedure is prone to biases and errors, and is extremely time-consuming and costly. Data mining tools perform data analysis and may uncover important data patterns, contributing greatly to business strategies, knowledge bases, and scientific and medical research. The widening gap between data and information calls for a systematic development of data mining tools that will turn data tombs into ―golden nuggets‖ of knowledge.

Direct Marketing identify which prospects should be included in a mailing list. Market segmentation, identify common characteristics of customers who buy same products , Market Basket Analysis , Identify what products are likely to be bought together , Insurance Claims Analysis , discover patterns of fraudulent transactions , compare current transactions against those patterns

Discover information that is ―hidden‖ in the data, associations (e.g. linking purchase of pizza with beer), sequences (e.g. tying events together: marriage and purchase of furniture), classifications (e.g. recognizing patterns such as the attributes of employees that are most likely to quit), forecasting (e.g. predicting buying habits of customers based on past patterns) Expert systems or small ML/statistical programs

Why Data Mining? — Potential Applications

Direct Marketing

identify which prospects should be included in a mailing list

Market segmentation

identify common characteristics of customers who buy same products

Market Basket Analysis

Identify what products are likely to be bought together

Insurance Claims Analysis

discover patterns of fraudulent transactions

compare current transactions against those patterns

What can data mining do?

Classification

– Classify credit applicants as low, medium, high risk

– Classify insurance claims as normal, suspicious

Estimation

– Estimate the probability of a direct mailing response

– Estimate the lifetime value of a customer

Prediction

– Predict which customers will leave within six months

– Predict the size of the balance that will be transferred by a credit card prospect

Association

– Find out items customers are likely to buy together

– Find out what books to recommend to Amazon.com users

Clustering

– Difference from classification: classes are unknown!

Data Mining: On What Kind of Data?

Relational databases

Data warehouses

Transactional databases

Advanced DB and information repositories

Object-oriented and object-relational databases

Spatial databases

Time-series data and temporal data

Text databases and multimedia databases

Heterogeneous and legacy databases

WWW

Data Mining In principle, data mining should be applicable to any kind of data repository, as well as to transient data, such as data streams. Data repositories will include relational databases, data warehouses and transactional databases, advanced database systems, flat files, data streams, and the World Wide Web. Advanced database systems include object-relational databases and specific application-oriented databases, such as spatial databases, time-series databases, text databases, and multimedia databases. The challenges and techniques of mining may differ for each of the repository systems.

A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. The software programs involve mechanisms for the definition of database structures; for data storage; for concurrent, shared, or distributed data access; and for ensuring the consistency and security of the information stored, despite system crashes or attempts at unauthorized access. A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table 8

represents an object identified by a unique key and described by a set of attribute values. A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases. An ER data model represents the database as a set of entities and their relationships.

Transactional Databases In general, a transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans ID) and a list of the items making up the transaction (such as items purchased in a store). The transactional database may have additional tables associated with it, which contain other information regarding the sale, such as the date of the transaction, the customer ID number, the ID number of the salesperson and of the branch at which the sale occurred, and so on.

Major Issues in Data Mining (1)

Mining methodology and user interaction

Mining different kinds of knowledge in databases

Interactive mining of knowledge at multiple levels of abstraction

Incorporation of background knowledge

Data mining query languages and ad-hoc data mining

Expression and visualization of data mining results

Handling noise and incomplete data

Pattern evaluation: the interestingness problem

Performance and scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed and incremental mining methods

Major Issues in Data Mining (2)

Issues relating to the diversity of data types

Handling relational and complex types of data

Mining information from heterogeneous databases and global information systems (WWW)

Issues related to applications and social impacts

Application of discovered knowledge

Domain-specific data mining tools

Intelligent query answering

Process control and decision making

Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem

Protection of data security, integrity, and privacy

Multi-Dimensional View of Data Mining

Data to be mined

Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW

Knowledge to be mined

Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

Multiple/integrated functions and mining at multiple levels

Techniques utilized

Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.

Applications adapted

Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, Web mining, etc.

Data Mining and Business Intelligence:

Association Rules:

The purchasing of one product when another product is purchased represents an association rule. These are used by retail stores to assist in marketing, advertising, floor placement and inventory control. Associations rules are frequently used show the relationships between data items. Association rules detect co mmon usage of items. A database in which an association rule is to be found is viewed as a set of tuples where each tuple contains a set of items .The support of an item or set of items is the percentage of transactions in which that item occurs. Given a target domain, the underlying set of items usually is known so that an encoding of the transactions could be performed before processing. Association rules can be applied to data domains other than categorical.

Why Is Association Mining Important?

Foundation for many essential data mining tasks

 Association, correlation, causality

 Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association

 Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)

Broad applications

 Basket data analysis, cross-marketing, catalog design, sale campaign analysis

 Web log (click stream) analysis, DNA sequence analysis, etc.

Association Rule Discovery Definition:

Given a set of records each of which contain some number of items from a given collection.

Procedure dependency rules which will predict occurrence of an item based on occurrences of

TID

Items

1

Rules Discovered:

{Milk} {Coke}

{Diaper, Milk}{Beer}Bread, coke, Milk

2

Beer, Bread

3

Beer, Coke, Diaper, Milk

4

Beer, Bread, Diaper, Milk

5

Coke, Diaper, Milk

Inventory Management:

Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of vists to consumer households.

Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns.

Data Preprocessing

Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. There are a number of data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in the data. Data integration merges data from multiple sources into a coherent data store, such as a data warehouse. Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. These techniques are not mutually exclusive; they may work together. For example, data cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date field to a common format. Data processing techniques when applied before mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining. We introduce the basic concepts of data preprocessing and subsequently descriptive data summarization, which serves as a foundation for data preprocessing. Descriptive data summarization helps us study the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning and data integration. The methods for data preprocessing are organized into the following categories: data cleaning, data integration and transformation, and data reduction. Concept hierarchies can be used in an alternative form of data reduction where we replace low-level data (such as raw values for age) with higher-level concepts (such as youth, middle-aged, or senior). This form of data reduction is the topic wherein we discuss the automatic generation of concept hierarchies from numerical data using data discretization techniques. The automatic generation of concept hierarchies from categorical data is also described.

Applications

Credit card fraud detection

telecommunication fraud detection

network intrusion detection

fault detection

many more

Measures: Three Categories

Distributive:

if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.

• E.g., count(), sum(), min(), max().

Algebraic:

if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.

• E.g., avg(), min_N(), standard_deviation().

Holistic:

if there is no constant bound on the storage size needed to describe a subaggregate

. • E.g., median(), mode(), rank().

Databases are rich with hidden information that can be used for intelligent decision making. Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large. Classification predicts categorical (discrete, unordered) labels, prediction models continuous valued functions. For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation. Many classification and prediction methods have been proposed by researchers in machine learning, pattern recognition, and statistics. Most algorithms are memory resident, typically assuming a small data size. Recent data mining research has built on such work, developing scalable classification and prediction techniques capable of handling large disk-resident data.

Applications of classification

Classification has many applications such as prediction of consumer behavior and identifying fraud.

For example, a credit card company may have a simple data of past applicants and knowledge about the applicants that were good credit risks and those that were not.

A classification method may use the sample to derive a set of rules for allocating new applications to either of the two classes.

Data classification is a two-step process. In the first step, a classifier is built describing a predetermined set of data classes or concepts. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or "learning from" a training set made up of database tuples and their associated class labels.

A classification process in which classes have been pre defined needs a method that will train the classification system to allocate objects to the classes. The training is based on a training sample, a set of sample data where for each sample the class is already known. We assume each object to have a number of attributes, one of which tells us which class the object belongs to. The attribute is known for the training data but the data other than the training data (we call this other data the test data) we assume that the value of the attribute is unknown and is to be determined by the classification method. The attribute may be considered as the output of all the other attributes and is often referred to as the output attribute or the dependent attribute. The attributes other than the output attribute are called the inner attributes or the independent attributes.

Classification Techniques

Decision Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines

Data prediction: Data prediction is a two step process, similar to that of data classification. However, for prediction, we lose the terminology of ―class label attribute‖ because the attribute for which values are being predicted is continuous-valued (ordered) rather than categorical (discrete-valued and unordered). The attribute can be referred to simply as the predicted attribute. Suppose that, in our example, we instead wanted to predict the amount (in dollars) that would be ―safe‖ for the bank to loan an applicant. The data mining task becomes prediction, rather than classification. We would replace the categorical attribute, loan decision, with the continuous-valued loan amount as the predicted attribute, and build a predictor for our task.

Comparing Classification and Prediction Methods

Classification and prediction methods can be compared and evaluated according to the following criteria:

1) Accuracy: The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data (i.e., tuples without class label information). Similarly, the accuracy of a predictor refers to how well a given predictor can guess the value of the predicted attribute for new or previously unseen data.

2) Robustness: This is the ability of the classifier or predictor to make correct predictions given noisy data or data with missing values.

3) Scalability: This refers to the ability to construct the classifier or predictor efficiently given large amounts of data

4) Interpretability: This refers to the level of understanding and insight that is provided by the classifier or predictor. Interpretability is subjective and therefore more difficult to assess. We discuss some work in this area, such as the extraction of classification rules from a ―black box‖ neural network classifier called back-propagation.

Decision Tree: A decision tree is a popular classification method that results in a flow chart like tree structure who‟s each node denotes a test on an attribute value and each branch represents outcome of the test. The tree leaves represent the classes.

Decision tree is a model that is both predictive and descriptive. A decision tree is a tree that displays relationships found in the training data. The tree consists of zero or more internal nodes and one or more leaf nodes with each internal node being a decision node having two or mode child nodes. Using the training data, the decision tree method generates a tree that consists of nodes that are rules .Each node of the tree represents a choice between a number of alternatives and each leaf node represents a classification or decision. The training process that generates the tree is called induction. Generally the number of training samples is likely to be relatively small if the number of independent attributes is small and the number of training samples required is likely to be large when the number of attributes is large. Normally the complexity of the decision tree increases as the number of attributes increases although in some situations it has been found that only a small number of attributes can determine the class to which the object belongs and the rest of the attributes have little or no impact. The quality of training data usually plays an important role in determining the quality of the decision tree.

Cluster Analysis- Introduction Imagine that you are given a set of data objects for analysis where, unlike in classification, the class label of each object is not known. This is quite common in large databases, because assigning class labels to a large number of objects can be a very costly process. Clustering is similar to classification in that the data are grouped. However, unlike classification, the groups are not predefined. Instead, the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters. Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used. Clustering has its roots in many areas, including data mining, statistics, biology, and machine learning.

What Is Cluster Analysis? The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups.

Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. Cluster analysis is an important human activity. Early in childhood, we learn how to distinguish between cats and dogs, or between animals and plants, by continuously improving subconscious clustering schemes. By automated clustering, we can identify dense and sparse regions in object space and, therefore, discover overall distribution patterns and interesting correlations among data attributes. Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing. In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive 3

plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost. It can also be used to help classify documents on the Web for information discovery. Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can also be used for outlier detection, where outliers (values that are "far away" from any cluster) may be more interesting than common cases. Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce. For example, exceptional cases in credit card transactions, such as very expensive and frequent purchases, may be of interest as possible fraudulent activity. As a data mining function, cluster analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features. Data clustering is under vigorous development. Contributing areas of research include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Owing to the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning, clustering is an example of unsupervised learning. Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, high-dimensional clustering techniques, and methods for c Clustering is a challenging field of research in which its potential applications pose their own special requirements.

General Applications of Clustering Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns Examples of Clustering

Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Requirements of Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Incorporation of user-specified constraints  Interpretability and usability

What Is Good Clustering?  A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation.  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now