Knowledge Discovery In Databases

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Keywords: Knowledge discovery, Clustering methods, Incremental clustering.

1. Introduction

Data mining and knowledge discovery in databases (KDD) have been attracting a huge amount of research, industry, scientific and media in real-time applications. The increasing speed of the large and enormous databases in recent life makes it hard for the human being to analyze and extract the knowledgeable information from the data, even though the analysts use the classic statistical (data mining) tools. The eventual purpose of this enormous data collection is the utilization of this information to achieve competitive results, by determining previously unidentified patterns in data that can lead to the process of decision making.

This can only be achieved by incorporating the appropriate prior knowledge and interpretation of data, rather than just going for the normal data mining techniques. This interpretation can be done by the tools and techniques in KDD, which is mostly needed for artificial intelligence and machine learning researchers. KDD includes techniques like description and prediction to extract the useful information.

Description provides the explicit information in a readable form and can be easily understandable by the user. Prediction is done based on the description of data, which is used to predict the future knowledge. It is easy to predict the knowledge from a small database. But while considering the large heterogeneous dataset, clustering is needed to analyze the data. We made an attempt to go for multi-method approach [20]; to combine two different clustering algorithms together to improve the clustering technique and therefore the knowledge discovery also becomes easier.

1. 1. Clustering

Clustering can be considered as the most important unsupervised learning technique (learning from raw data); that is, it deals with finding a structure in a collection of unlabeled data. A cluster in a database can be defined as "A grouping of related items stored together for efficiency of access". This clustering can also be done in a more effective manner, when it deals with semi-supervised [16] or supervised learning.

Clustering algorithms can be broadly classified into four categories:

Flat clustering – This technique clusters the data without any explicit connection structure with other clusters. That is, there will be no relation among the clusters. Even though the efficiency is considered as the major factor here, predefined input, unstructured clustering and its non-deterministic nature are its limitations.

Hierarchical clustering – When efficiency is not considered as a primary concern, we may go for the hierarchical clustering. This overcomes the drawbacks in flat clustering like its unstructured cluster and it is deterministic in nature. The incremental concept of this technique reduces the time complexity in dynamic dataset. It doesn’t work well for clustering the overlapping data objects.

Hard clustering – This technique clusters the data when the specified data exactly belongs to one cluster. Even though random partitioning is carried out, the resulting clusters will be more efficient and robust. Efficient clustering of overlapping points can be done and it improves the performance.

Soft clustering – This is just contrary to hard clustering where the data may belong to more than one cluster.

Clusters can be validated by its quality and dynamic nature (scalability) of the databases.

Quality - Quality in clustering depends on the internal and external measures. Intra-Cluster tightness and Inter-Cluster separation falls in the internal quality measures, whereas the calculating the data points which wrongly falls in the cluster comes under external measures. Most of the data mining algorithms cannot be applied to complex object databases unless it is converted into a single flat table, which also affects the quality of the clusters.

Scalability - Another factor which affects the cluster is the need for the periodic update of the data in the maintenance phase, which can be dealt with the incremental clustering concept.

The main contribution of this paper includes

About the Hard-Flat (Section 3.3) and hierarchical clustering (Section 3.4) methods and its algorithms.

In Section 3.5, we made an attempt to combine the Hard-Flat clustering along with the Hierarchical incremental clustering [19] method.

Experimental results of the incremental clustering method will be explained in section 4, by using k-means algorithm with the incremental algorithm.

2. Related work

In [1], Danyang Cao and Bingru Yang suggested an improved k-Medoids clustering algorithm, to overcome the disadvantages like time complexity, scalability on large datasets called CFk-medoids clustering algorithm. This algorithm is based on the clustering features of BIRCH algorithm and also works in the outlier’s problem. CF is a three dimensional vector summarizing information about the cluster of objects. This algorithm improves the clustering quality and increasing the performance by removing outliers from the CF tree. In [15], Weiguo Sheng and Xiaohui Liu suggests to improve the k-medoids algorithm by combining the k-medoids with genetic algorithm, which helps to improve performance and efficient clustering.

In [2], Rehab F.Abdel-Kader attempts to overcome the disadvantages of k-Means like its sensitiveness to the selection of initial partitions and its convergence to local optimal search by proposing the GAI-PSO k-means clustering algorithm (Genetically Improved Particle Swarm optimization). Author combines the concept of global optimum searching of the evolutionary algorithms and the fastest convergence of k-means algorithm and therefore it will avoid the drawback of both. This also combines the standard velocity and population update rules of PSO with the ideas of selection and crossover. Apart from better response, it also converges more quickly than the ordinary evolutionary methods like ACO, PSO and SA.

In [3], Liang sun et.al, proposed the k-means based hybrid clustering algorithm. The hybrid clustering here includes the k-Means algorithm with the support vector clustering (SVC) algorithm. SVC is not well suited to organize the more overlapping type of cluster structures. K-Means lacks in clustering, when the data is of complex structure or in unknown shape. When combining these two algorithms, the weakness of one algorithm can be eliminated by the other and thus improves the quality of the cluster and therefore this method is more effective.

In [4], XING Xiao-shuai et.al proposed the Immune programming algorithm, which defeats the local searching property of k-means with its global searching optimal property. This feature is extracted from the evolutionary programming concept. And in this algorithm by using the prior knowledge, analyzing the non-related data can be avoided and thus improves the speed of the calculation and it is more robust

In [5], Ming-Yi shih et.al attempt to combine and cluster two different features (Numerical and Categorical) of data. The categorical attributes are processed to construct the relationship among them based on the co-occurrence. And then based on the relationship, the categorical attributes are converted into numeric attributes and clustering is done. After the TMCM algorithm, HAC (Hierarchical Agglomerative Clustering) and k-means algorithm are integrated and used in this paper for clustering.

In [6], David Littau and Daniel Boley suggested a scalable method to cluster datasets that are too large to fit in memory. PMPDDP algorithm (Piecemeal Principal Direction Divisive Partitioning algorithm) is used, in which the original data is broken up into sections which will fit into memory and clustered. The sections are clustered using PDDP and the sparse product representations are collected and final clustering is done. PMPDDP is more flexible for memory allocation, which increases the accuracy of the clustering. But it does suffer from an increased time cost due to the intermediate clusters computation.

In [7], Sudipto Guha et.al deals with the multifeature data analysis by using the ROCK (Robust Clustering using links) algorithm, which employs links and not distances when forming clusters. This presents a good quality cluster and also exhibits good scalability properties. The larger the number of links between a pair of points, the greater is the likelihood that they belong to the same cluster. Thus clustering using links injects global knowledge into the clustering process and is thus more robust.

In [8], Fazli Can proposed the concept of incremental clustering for dynamic information processing by the C2ICM (Cover- co-efficient based Incremental Clustering for Maintenance) algorithm. The author suggests rather than just forming the clusters, cluster maintenance is also needed for dynamic information. Here the cluster-splitting approach is used to insert/update a new data into a cluster by reducing the time complexity.

In [9], Mu-Chun Su and Chien-Hsing Chou proposed a modified version of k-Means algorithm where the distance metric is calculated based on the cluster symmetric value. That is, after the initial centroid calculation is done by Euclidean distance metric, fine tuning is done based on the cluster point symmetry. The flexibility achieved by the SBKM (Symmetry-based version of k-Means) algorithm increases the computational complexity.

In [10], Dwi H.Widyantoro et.al proposed a novel Incremental Hierarchical clustering (IHC) algorithm, which mainly deals with the homogeneity and monotonicity properties. This method considers the data as a tree-structure and it focuses on the inefficiency of clustering in dynamic environment.

In [11], Gabriela Serban and Alina Campan suggested a new Hierarchical Core-Based Incremental Clustering (HCBIC) algorithm. When a new data enters into the database, repartitioning of the cluster objects are carried out. The decrease in the number of iterations reduces the time complexity

In [12], Sophoin Khy et.al projected a Novelty-based incremental clustering, where the novelty the authors mentioned in the paper represents a similarity function and clustering method (a variant of k-Means). They proposed a forgetting factor for incremental clustering, which is done based on the F2ICM (Forgetting Factor-Based Incremental Clustering Method).

In [17], M.Srinivas and C.Krishna mohan proposed a Leaders complete linkage (LCL) clustering algorithm for hierarchical and incremental clustering methods and they analysed the results along with the Agglomerative Single-link clustering algorithm (ASLCA). By using the LCL, the inter-distance and intra-distance measures are calculated in an efficient way and it helps to improve the quality of clusters.

3. Proposed method:

3. 1. Hard-Soft clustering:

In hard clustering, disjoint partition of the data is done; so that each data point exactly belong to only one of the cluster. Soft clustering can be defined as each data point has a certain probability of belonging to each of the partitions. Hard clustering can be called as soft clustering when the probabilities are either 0 or 1, whereas they can take any non-negative values in soft computing.

3. 2. Flat clustering:

Flat clustering creates a group of clusters with only an implicit structural connection, that would relate clusters to each other and strictly it should not have explicit relationship among the clusters. This method is similar to hard clustering, but it is allotted to a cluster having a single type of data attribute alone.

3. 3. Hard-Flat Clustering:

Definition:

Given a set of data D = {d1, d2, …, dk}, user-defined number of clusters (k), where the cluster is denoted by (C) and the function (Ƒ) to evaluate the quality of cluster, where

Ci =D {1,2, .. , k}

Where the minimized metric function is considered and none of the clusters should be empty. The Hard-Flat clustering [21] uses the metric function, which is calculated based on the distance between objects.

Procedure:

Choose a flat dataset (strictly a single data attribute).

Randomly choose the centroid from the dataset.

Calculate the function (distance measure) of the data with each of the centroid.

Form the cluster with the minimum values.

Repeat step (ii), (iii) and (iv) till terminating condition occurs.

Hard-Flat Algorithm:

Initialize

While (Converged)

Assign each data point x, to the nearest cluster , such that

h= argmin(x, )

Re-estimate the representatives.

=

Complexity:

The time complexity for Hard-Flat clustering is О(nkmi), where

n Number of data objects

m Distance measure calculation

k Number of centroids

i Number of iterations

For this Hard- Flat clustering, k-Means is considered as the best algorithm because of its simplicity and efficiency. It suits for Hard clustering because the data objects will strictly fall within a single cluster and its nature of adapting only a single attribute makes it suits for flat clustering.

3.4. Hierarchical clustering:

Definition of Hierarchical clustering:

Hierarchical clustering [14] is considered as a greedy algorithm, where the concept of merging (Agglomerative) and splitting (Divisive) is used to cluster the objects. The result of clusters will be like a hierarchy of clusters; a Tree-like structure (Dendogram). This doesn’t need the cluster size or the initial centroid for clusters. All this algorithm needs is the measure of a similarity between the data objects. The similarity measure of inter-cluster data objects can be calculated by single-linkage, complete-linkage and group-average clustering methods.

Procedure:

Assign each data object as a single cluster.

Similarity measure (distance) is calculated for each of the clusters with its neighbours.

Compute distance of the new cluster with each of its cluster.

Repeat steps (2) and (3) till the similarity measure gets maximum.

EfficientHAC algorithm:

Input: Dataset D (d1, d2, …., dN)

For n 1 to N do

For i 1 to N do

//sim similarity measure

C[n][i].sim dn.di

C[n][i].index I

I[n] 1

P[n] Priority queue for C[n] started on sim

// No need for self-similarities

P[n].Delete(C[n][n])

A []

For K 1 to N-1 do

K1 argmin{k.I[k]=1} P[k].Max().sim

K2 P[K1].Max().index

A.Append(<K1, K2>)

I[K2] 0

P[K1] []

For each i with I[i] =1 ^ i ≠ K1 do

P[i].Delete(C[i][K1])

P[i].Delete(C[i][K2])

C[i][K1].sim sim (I, K1, K2)

P[i].Insert (C[i][k1])

C[K1][i].sim sim (I, K1, K2)

P[K1].Insert ( C[K1][i])

Return A

Similarity measures[13]:

Single-Linkage (Minimum/ Connectedness)

Min di€G, j€H d (i,j)

Complete-Linkage (Maximum/ Farthest)

Max di€G, j€H d (i,j)

Group-Average

Complexity:

In hierarchical agglomerative [14], for worst case it is О(n3) and for best case, it is О(N2 log N).

Hierarchical clustering can be also defined as incremental clustering because of its reduced time complexity and non-determined cluster size. It is well suited for dynamic real time dataset.

3.5. Incremental Clustering method:

K-Means is well suited for static type of huge data, but it is not suited well enough for dynamic data. In the same way, hierarchical clustering is well-matched for incremental concept, but it is not well-efficient for large database. So, we are proposing an incremental clustering concept which merges the advantage of k-means with the incremental concept.

3.5.1. K-Means algorithm:

K-Means is considered as the simple and efficient unsupervised type of algorithm, even though it lacks in global searching of objects, increase in time complexity in real-time environment (dynamic nature). The distance measure here is calculated based on the Euclidean distance formula. Since this paper focuses about the numeric type of data attributes, one-dimensional formula of Euclidean distance metric is used.

Ω (x, y) = |x-y|

Where Ω (x, y) is the measure of distance between two points x (centroid of clusters) and y (data)

µk

Centroid is re-calculated by using the above average condition. When these centroid calculations and re-computation of clusters are made in parallel [18], the time complexity can be reduced.

Algorithm:

Input: Training set of data (D), Number of Clusters (K)

Output: Prediction of data items for each Cluster (Ck) and Final Centroids (µk).

(C1, C2, …, Ck) SelectRandomCentroid({d1, d2, …, dk}, K)

For k = 1 to K do

µk Ck

While termination condition not reached do

For k 1 to K do

ψk {}

For n 1 to N do

j minj (| µj – dn |)

// Re-computation of clusters

Ψk Ψk U { dn }

For k 1 to K do

// Re-computation of centroids

µk

Return { µ1, µ2, …., µk}

Termination condition can be considered based on the any of the user perspectives like:

Fixed number of clusters.

Data should be same within the cluster structures for different iterations.

Centroids should be same for different iterations.

Setting a threshold value in its minimum Euclidean distance metric.

3.5.2. Incremental Clustering:

Then we are going to integrate the incremental concept with k-means algorithm, to make it suitable for dynamic dataset also.

Procedure:

The initial set of data is clustered by means of k-means.

Calculate the distance measure of the new data with each of the centroid.

If (Incremental)

Insert the data with its minimum distance value to the cluster and dataset separately, so that it won’t affect the static clustering done by k-means.

If (Decremental)

Find the cluster by its minimum distance value

Delete the data from the cluster and dataset separately.

Else if (Update)

Perform deletion and then insertion

Algorithm:

Incremental ( )

D { dn } U d

For k 1 to K do

j minj (| µj – dn |)

Ψk Ψk U { dn }

Decremental ( )

D { dn } - d

For k 1 to K do

j minj (| µj – dn |)

Ψk Ψk - { dn }

Update ( )

Decremental ( );

Incremental ( );

By merging these concepts, we can reduce the time complexity and computation complexity of both the clustering methods.

4. Experimental analysis:

Let us consider a dataset of having data in its minimal thousands and execute it in two different forms. One is to run the code for Hard-Flat clustering (k-means) alone and the execution times are calculated. And the second option is to run by our incremental clustering concept, where the initial clustering is done by k-means and then incremental method is carried out. Table 1. shows the comparison of k-means and incremental method by considering its execution time.

Table . Analysis for k-means and incremental method.

Data Increment

Execution time (Milliseconds)

k-Means

Incremental

Time savings

0 to 403

435

-

403 to 500

522

102

420

500 to 653

762

187

575

653 to 859

993

252

741

Total time savings

1736

Figure . Analysis of incremental clustering

In Figure 1, the execution time (in milliseconds) is calculated across the data. By using the k-means, initial 403 data are clustered together in 435ms. When the clustering is done for every increment of data till near some thousands of data, the execution time goes around 1000ms. In our incremental method, the same thousands of data can be executed in a short duration by nearly reducing 3/4th of the execution time of k-means. From this experimental result, we can conclude that even for a huge data, the time will be much reduced and therefore the efficiency will be increased to a great extent.

5. Conclusion:

Clustering is considered as an important method to retrieve the information, while working in a large dataset. A perfect clustering of dataset can have the ability to make the entire prediction in a more easy way. So rather than using a single clustering method, an integration attempt is carried out in this paper. The experimental results shows that the time variation of the normal Hard-Flat clustering with the Incremental method and how well the time complexity is reduced. By this method, the efficiency can also be improved in predicting the knowledge.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now