Mining Uncertain Data Using Decision Tree Computer Science Essay

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Miss. Phadatare Manasi M. ME-II COMPUTER Engg. Vidya Pratishthas College of Engineering,Baramati.

Prof. Mrs. S.S.Nandgaonkar,COMPUTER Engg. Vidya Pratishthas College of Engineering,Baramati.

Abstract

Data mining is the science and technology of exploring data in

order to discover previously unknown patterns. For efficient

use of abundant information in databases classification of

info is necessary. In many applications such as location-based

services, sensor monitoring, image recognition, and medical

diagnosis, credit rating , scientific tests, fraud detection, etc.

Due to inaccurate measurement, sampling discrepancy, outdated

data sources, data is often associated with uncertainty.

To classify uncertain data properly decision trees can be used.

Decision tree is a predictive model which can be used to represent

classifiers. Decision trees are used to classify an object

or an instance to a predefined set of classes based on their

attribute values. Traditional decision trees use single point

attribute value to classify the instance. In case of uncertainty

instead of using single point value decision tree uses the probability

density function PDF for classification. Processing

PDF of uncertain data requires more values to compute in

decision tree construction on uncertain data. To improve efficiency

of decision tree generation bagging and boosting techniques

can be used.

Key terms

Classification, Data Mining, Decision Tree, Uncertainty,

Machine Learning

1. Introduction

Uncertainty in data becoming great challenge in data

mining.Data mining is the part of KDD process where

knowledge from large data is extracted. Classification

of such uncertain data is large issue in data mining. A

number of factors that causes uncertainty are data staling,

measurement and decision errors, unreliable data

transmission and data staleness etc. Like numerical data

uncertainty lies in categorical data. Decision Tree is very

effective classification tool as it can handle both numerical

and categorical data. A decision tree partitions data

into smaller segments called terminal nodes. Each terminal

node is assigned a class label. The non-terminal

nodes, including the root and other internal nodes, contain

attribute test conditions to separate records with

different characteristics. The partitioning process terminates

when the subsets contain homogeneous records and

it cannot be partitioned any further using predefined criteria.

This paper studies decision tree based classification

method for uncertain data. In many applications, data

contains inherent uncertainty. Since data uncertainty is

ubiquitous, it is important to develop classification models

for uncertain data. We focus on the decision tree

based classification approach. We choose the decision

tree because of its numerous positive features. Decision

tree is simple to understand and interpret. Decision tree

can handle both numerical and categorical data, while

many other techniques are usually analyse datasets that

have more probably numerical data. Decision tree uses

a white box model. It is possible to validate a decision

tree model using statistical tests. Decision tree is also

robust and scalable. It performs well with large data in

a short period of time. The main contributions of this

paper are:

1) To integrate the uncertainty data model into the

design of the decision tree,

2) To build decision trees for uncertain data using the

Distribution-based approach and Averaging approach

and compare both approaches on basis of accuracy,

3) To check whether the techniques like bagging

and boosting could lead to high speed decision tree

construction and high accuracy.

This paper is organized as follows. In the next section,

we will discuss related work. Section 3 describes

the programmer design for uncertain data model.It includes

mathematical model, Dynamic Programming and

Serialization,Data independence and Data Flow architecture,

Multiplexer Logic and Turing Machine. Section

4 contains the experimental results and important discussions.

Finally section 5 concludes the paper.

2. Related Work

Classification is one of the classical data mining problems

found in real life.Many classification algorithms

have been proposed in the literature, such as decision

tree classifiers , Bayesian classifiers , support vector machines

, artificial neural networks and ensemble methods.

Many recent techniques for collecting data results in an

uncertain data. While many applications lead to data

which contains errors, we refer to uncertain data sets as

those in which the level of uncertainty can be quantified

in some way. Classifying these uncertain data has

remained great challenge.

1

Previous work performed on developing decision trees

when data contains missing or noisy values.Decision tree

classification with missing data has been addressed for

decades. Missing values appear when some attribute values

are not available or wrong entries are made during

data collection. Solution to this problem is to include inferring

the missing value (exact or probabilistic values)

using a classifier on the attribute (ordered attribute tree

and probabilistic attribute tree). In this paper instead of

assuming part of data is missed, we allow whole dataset

to be uncertain and it is represented as uncertain intervals

and probability distribution function instead of of

showing erroneous values.

While considering clustering of uncertain

data[7][8][9][10] the probability distributions of objects

are used to compute the expected distance between

two uncertain objects. The UKmeans is the extension

for well known Kmeans algorithm which is used to

classify the uncertain objects.

In fuzzy decision tree classification [13], both attributes

and class labels can be fuzzy. A fuzzy attribute

of a data tuple is given, a degree is assigned to each possible

value, showing the extent to which the data tuple

belongs to a particular value.

Cloud model [10] is presented to mine spatial data with

both randomness and fuzziness. It may also act as an uncertainty

transition between a quantitative data which is

the basis of spatial data mining in the context of uncertainties.

A cloud model is a mathematical model of uncertainty

transition between a linguistic term of a quantitative

concept and its numerical representation data.

Our work instead of gives classification results as a

distribution for each test tuple, gives a distribution of

how likely it belongs to each class. Building a decision

tree on tuples with numerical, point valued data is computationally

demanding. Finding the best split point is

computationally expensive. To improve efficiency, many

techniques have been proposed to reduce the number of

candidate split points. Our work can be considered an

extension of these optimization techniques.

2.1. Proposed System

To construct classification model that classify each tuple

with uncertainty using decision tree classifier. It involves

finding a good testing attribute and a good split point

for each internal node including root node, as well as

an appropriate probability distribution over class label

to handle uncertainty.after selecting appropriate testing

attribute we apply averaging and distribution based approach

on selected attribute to classify tuple correctly.

Various techniques such as pruning are then applied for

more accuracy.

3. Programmer’s design

3.1. Mathematical Model

1. Let DTU be a system that describes decision tree

classifier for uncertain data

DTU ={

2. Identify input as S

DTU ={S

S = {te, tr; set of testing and training tuples

3. Identify input as O

DTU ={S,O

O ={tei, ci ; each test tuple is assigned a class label

4. Identify Process as M

DTU = {S,O,M

M = {H(z,Aj), pL, pR, fLjn(x)

5. H(z,Aj) = entropy and information gain for selecting

attribute

H(z,Aj) =

X

X=L,R

|X|

|S|

(

X

c_C

−

p

c

log2

p

c

) (1)

6. pL = probability of left subtree

pL =

R zn

ax,jn

fx,jn(t)dt

7. pR = probability of right subtree

pR = 1 − pL

8. fLjn(x) = PDF for attribute Ajn

fLjn(x) =

(

fxjn(x)/wL, ifx_[axjn, zn]

0, otherwise

(2)

9. Identify Failure as F

DTU= { S,O,M,F

- failure occurs if tuples are not correctly classified.

Input :

2

Figure 1: Input Dataset

Output:

a)Averaging :

Figure 2: Decision Tree : Averaging Approach

b) Distribution Based:

Figure 3: Decision Tree: Distribution Approach

3.2. Dynamic Programming and Serialization

Decision Tree Induction Algorithm:

input: the training dataset D; the set of candidate

attributes att-list

output: An uncertain decision tree

begin

1. create a node N;

2. if (D are all of the same class, C) then

return N as a leaf node labeled with the class C;

3. else if (attribute-list is empty) then

return N as a leaf node labeled with the highest

weight class in D;

end if ;

4. select a test-attribute with the highest probabilistic

information gain ratio to label node N;

5. if (test-attribute is numeric or uncertain numeric)

then

binary split the data from the selected position zn;

6. for (each instance ti) do

7. if (test − attribute _ y) then

put it into Dl with weight ti.w;

8. else if (test-attribute ¿ y) then

put it into Dr with weight ti.w;

9. else

put it into Dl with weight Dl = ti · w ·

R zn

ai,jn

f(x)dx

put it into Dr with weight Dr = ti ·w ·

R bi,jn

zn

f(x)dx

end if ;

end for;

10. else

11. for (each value ai(i = 1, . . . , n) of the attribute)

do

grow a branch Dr for it;

end for;

12. for (each instance ti) do

13. if (test-attribute is uncertain) then

put it into Di with ti · ai · w · ti.w weight;

14. else

put it into a certain Di with weightti .w;

end if

end for;

end if ;

15. for each Di do

attach the node returned by DTU(Di, att-list);

end for;

end

3

a)Averaging Approach:

The algorithm starts with the root node and with

S is the set of all training tuples. At each node n,

we first check if all the tuples in S have the same

class label c. If so, we make n a leaf node and set

Pn(c) = 1, Pn(c0) = 0wherec0 6= c. Otherwise, we select

an attribute Ajn and a split point zn and divide the

tuples into left and right subsets. Tuples with vi,jn _ zn

are put on left side and rest goes to right side.

If either left or right subset is empty , then attribute

cannot be used further for classifying the tuples in S. In

this case, we make n a leaf node. If neither left nor right

is empty, we make n an internal node and create child

nodes for it. We recursively invoke the algorithm on the

left child and the right child, passing to them the sets

left and right, respectively.

b) Distribution Based Approach:

For uncertain data, we adopt the same decision tree

building framework as described for handling point data.

After an attribute Ajn and a split point zn have been

chosen for a node n, we have to split the set of tuples

S into two subsets L and R. The major difference from

the point data case lies in the way the set S is split.

The pdf of a tuple tj S under attribute Ajn spans the

interval [ai,jn, bi,jn]. Ifbi,jn _ zn, the pdf of ti lies

completely on the left of the split point, and thus,

tj is assigned to L. Similarly, we assign ti to R if

zn < ai,jn If the pdf properly contains the split point,

i.e.,bi,jn _ zn < bi,jn, we split ti into two fractional

tuples tL and add them to L and R, respectively.

This illustrates that by considering probability distributions

rather than just expected values, we can

potentially build a more accurate decision tree and

hence, there are more choices of split points.

3.3. Data independence and Data Flow architecture

The system architecture for Decision Tree Classifier is as

follows: Decision Tree classifier works in 3 steps

1) Selection of Attributes

2) Generating Decision Tree

3) Pruning

1. Attribute Selection :

To build a good decision tree, the choice of Attribute and

split point is crucial. The best choice of attribute and

split point minimizes the degree of uncertainty. Degree

of uncertainty is measured in many ways such as entropy,

Figure 4: DO: attribute selection

Figure 5: Decision Tree Classifier

gain ratio and Gini Index. In this paper we are concentrating

on entropy and gain ratio as given in equation

(1).

2. Generating Decision Tree:

There are two approaches for handling uncertain data

while generating Decision tree. The first approach, called

Averaging, transforms an uncertain data set into a pointvalued

one by replacing each PDF with its mean value.

In our second approach Distribution Based full information

is exploited by considering all sample points that

constitute PDF. The challenge here is that a training tuple

can now pass a test at a tree node probabilistically

when its PDF properly contains the split point of the

test.Many algorithms, such as ID3 and C4.5 , have been

devised for decision tree construction.

A) Averaging Approach:

The first approach, called Averaging,transforms an uncertain

data set into a point-valued one by replacing

each pdf with its mean value. More specifically, for

each tuple ti and attribute Aj , we take the mean value

vi,j =

R bi,j

ai,j

xfi,j(x)dx as its representative value. The

feature vector of ti is thus transformed to (vi,j , ....., vi,k

A decision tree can then be built by applying a tradi-

4

tional tree construction algorithm.

B) Distribution Based Approach :

To extract the full information carried by the PDFs,

our Distribution-based approach, considers all the sample

points that constitute each pdf. The challenge here

is that a training tuple can now pass a test at a tree

node when its pdf properly contains the split point of

the test. Also, a slight change of the split point modifies

that probability, potentially altering the tree structure.

3) Pruning:

Dictionary meaning of pruning is removing unwanted

branches of tree. This is a way of reducing the size of

the decision tree. This reduces accuracy on the training

data, but increase the accuracy on unseen(uncertain)

data. It is used to minimise effect of over fitting, where

you would achieve perfect accuracy on training data, but

the model (i.e. the decision tree) you learn is so specific

that it doesn’t apply to anything but that training data.

In general, if you increase pruning, the accuracy on the

training set will be lower.

Global Pruning algorithm is very effective in pruning intervals.

GP reduces the number of entropy calculations

to only 2.7 percent of that of Distribution based decision

tree.

We are pruning away only candidate split points that give

suboptimal entropy values. This means we are removing

the split points those give empty subsets. So, even after

pruning, we are still finding optimal split points. Therefore,

the pruning algorithms do not affect the resulting

decision tree. Pruning speed up the tree building process.

following figure shows resultant decision tree of figure after

pruning.

There are various methods of pruning: 1)Pruning Empty

and Homogeneous Interval,2) Pruning by Bounding,3)

End point sampling. Here we are considering Pruning

Empty and Homogeneous Interval and End Point sampling.

We say interval is empty if no pdfs intersect it.An interval

is homogeneous if all the pdfs that intersect it come

from tuples of the same class. To solve this problem we

prune some optimal split points.

In bounding method we attempt to prune heterogeneous

intervals. For doing so we calculate lower bound for heterogeneous

interval and pruning threshold. If value of

lower bound is less than pruning threshold, then no candidate

split point within interval. This means the whole

interval is pruned.

In case of end point sampling instead of taking all split

points we take sample end points(say 20 percent) and

use entropy for these split points to calculate pruning

threshold. This leads to minimum entropy calculations.

Figure 6: Decision Tree : after pruning tree in figure 3

3.4. Multiplexer Logic

Set of training tuples and set of attributes is given as an

input to system. To handle uncertainty in data PDF of

tuple for perticular attribute is considered as split point

instead of single value.Decision Tree Classifier is built by

using training tuples. Rest of the tuples in data set are

test tuples. Test tuples are assigned a class label using

classifier. This procedure repeats till all test tuples are

correctly classified. Pruning is applied on resulting tree

to reduce size of decision tree.

We tried to cover the spectrum of properties such as

size, attribute numbers and types, number of classes and

class distributions. Based on the ID3 and C4.5 implemented

on Weka, we construct the decision tree classifier.

The experiments are executed on a PC with an Intel

core i3 2.53GHz CPU and 3.0 GB main memory.

3.5. Turing Machine

Figure 7: State Transition Diagram

State Description

0 - 1 read data set.

1 - 2 take training tuples.

2 - 3 select proper attribute.

5

3 - 4 apply averaging approach.

3 - 5 apply distribution based approach.

4,5 - 6 classification of tuples.

6 - 2 if all test tuples are not classified.

6 - 7 apply pruning on decision tree.

7 - 8 return pruned decision tree.

4. Results and Discussion

4.1. Results

Expected results are shown in figure when decision tree

classifier is applied on data. After applying ID3 and C4.5

learning algorithms on our dataset, we get accurately

classified test tuples and results are displayed in form

of decision tree and graph showing how much accuracy

is achieved with different data sets. To explore the

potential of achieving a higher classification accuracy by

considering data uncertainty, averaging and distribution

based approach are applied on real data sets taken

from the UCI Machine Learning Repository.These data

sets are chosen because they contain mostly numerical

attributes obtained from measurements. For the purpose

of our experiments, classifiers are built on the numerical

attributes and their class label attributes.

Figure 8: Data from UCI Machine Learning Repository

4.2. Discussions

Uncertainty Model

Uncertainty models of attributes have been assumed

known by some external means. In practice, finding a

good model is an application-dependent endeavour.Using

uncertainty information modelled by PDFs can help us

Figure 9: Result: Classification of data [glass dataset]

Figure 10: Result : Extent of uncertainty

construct more accurate classifiers, it is assumed that

data collectors provide complete raw data instead of a

few aggregate values.

Handling categorical data

Like their numerical counterparts, uncertainty can arise

in categorical attributes due to ambiguities, data staleness,

and repeated measurements. As a heuristic, a

categorical attribute that has already been chosen for

splitting in an ancestor node of the tree need not be reconsidered

because it will not give any information gain

if the tuples in question are split on that categorical attribute

again.

ID3-

Iterative Dichotomiser 3 or ID3 is a simple decision learning

algorithm developed by J. Ross Quinlan (1986). ID3

is a greedy approach. ID3 constructs decision tree topdown,

on given sets of training data to test each attribute

at every node. It uses statistical property call information

gain to select which attribute to test at each node

in the tree. Attribute with maximum information gain

can classify the tuples accurately.

6

C4.5-

This is the enhancement of ID3 algorithm. It is decision

tree learning algorithm, which uses information

entropy concept .Decision tree partitions instances from

root node to some leaf node and a tree is constructed.

It uses information gain values as for classification .In

classification ,higher gain value s get selected.Pruning is

included in this algorithm which was not in ID3.

Bagging -

Bagging method was formulated by Leo Breiman.

Its name was deduced from the phrase bootstrap

aggregating. Bagging method is used to improve results

of classification algorithm. Classification algorithm creates

classifier based on training tuples. Bagging method

first create sequences of classifiers in respect of modification

in training set. These classifiers are combined into

a compound classifier. The prediction of the compound

classifier is given as a weighted combination of individual

classifier predictions.

Boosting -

Decision trees are powerful, but unstable. Boosting is

used to minimise the time required foe decision tree construction.

A small change in the training data can produce

a large change in the tree. If we use boosting

the training events which were misclassified have their

weights increased (boosted), and a new tree is formed.

5. Conclusion

The model of decision tree classification to accommodate

data tuples having numerical attributes with uncertainty

described by arbitrary pdfs.Classical decision tree building

algorithms modified to build decision trees for classifying

such data. We have found empirically that when

suitable pdfs are used, exploiting data uncertainty leads

to decision trees with remarkably higher accuracies.

Because of the increased amount of information to be

processed, as well as the more complicated entropy computations

involved,performance of decision tree can be

decreased. Therefore, a series of pruning techniques to

improve tree construction efficiency are devised.

The performance of decision tree can be improved by

using bagging and boosting techniques. This is enhancement

in the existing system which gives high accurate

decision tree.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now