Clustering Is A Separation Of Data Object

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

CHAPTER 1

INTRODUCTION

Clustering is a separation of data object into groups of similar objects. On behalf of the data by fewer clusters essentially loses certain fine details, but it achieves simplification. Data modeling puts clustering in a chronological point of view entrenched in mathematics, statistics etc. From machine learning point of view the clusters are the hidden patterns used to the search for clusters is unsupervised learning resulting in representing a data concept. In real time view clustering plays an great role in data mining applications. The applications are technical data investigation, information recovery and content mining, spatial database applications, Web analysis, advertising, medical diagnostics, computational environmental science, and many others (Miroslav Marinov et al., 2004).

Clustering is the area under discussion of follows a line of investigation in numerous fields like statistics, pattern recognition, biometrics and machine learning. Data mining is the concept it adds to clustering with various difficulties of very large datasets. This gives a unique calculation requirement on considerable clustering algorithms. Assortments of various algorithms are introduced and it is applied to real time data mining problems effectively to meet their requirements.

Cluster analysis is a collection of objects based on the data that are found in the data describing the objects. The main goal of the clustering is that the objects in a group will be alike or related to other and different from of objects into other groups. There is a greater similarity within a cluster and the differentiation between the groups is greater then it would be a better clustering. The cluster definition is not defined well in many applications and the required clusters are not well separated from one another. However, most of the cluster analysis gives as a result, a crisp classification of the data into non-overlapping sets or groups. Difficulty of deciding what constitutes is a cluster is to be understood better for that consider figures 1a through 1d, which show twenty points and in three dissimilar ways that can be divided into clusters. Hence, if the clusters are to be nested, the structure of these points contains more reasonable interpretation of two clusters, each of the clusters which have three subclusters. On the other hand, the obvious division of the two larger clusters divided into three subclusters may be an artifact of the human.

Lastly, it may not be difficult to declare that the points from four clusters. Thus, once again the definition of what constitutes a cluster is inaccurate and the finest definition depends on the type of data and the preferred results.

Figure 1: a) Initial Points

Figure 1: b) Two Clusters

Figure 1: c) Six Clusters

Figure 1: d) Four Clusters

Figure 1: Types of Clusters

SOME WORKING DEFINITIONS OF A CLUSTER

Generally the cluster does not have a common definition. Several working definitions of a cluster are used in practice which is described in (Richard C. Dubes and Anil K. Jain, 1988).

Definition for Well-Separated Cluster

A well separated cluster is a set of cluster points such that any point in a cluster is very closer to every other point in the cluster.

Occasionally a threshold concept is used to define all the points in a cluster; it must be sufficiently close to one another.

Figure 2: Three well-separated clusters of 2 dimensional points.

On the other hand, in various sets of information a point on the edge cluster possibly will be closer (or more similar) to some of the objects in another cluster than to objects in its own cluster. As a result, many of the clustering algorithms are used by the following criterion.

Center-based Cluster Definition

A cluster is a group of objects in which an object in a cluster is closer and closer to the center of a cluster than to the center of other cluster. The center of a cluster is always known as a centroid, the average of all the points in the cluster is a most delegate point of a cluster.

Figure 3: Four center-based clusters of 2 dimensional points

Contiguous Cluster Definition

A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

Figure 4: Eight contiguous clusters of 2 dimensional points

Density-based definition

Density based cluster is a cluster in which a dense region of points is separated by low-density regions from the other regions of high density. This definition is more often used when the clusters are irregular or intertwined, and when noise and outliers are present. Note that the contiguous definition would find only one cluster in figure 6. Also note that the three curves don’t form clusters since they fade into the noise, as does the bridge between the two small circular clusters.

Figure 5: Six dense clusters of 2 dimensional points

Similarity-based Cluster definition

A cluster is a set of objects that are similar and objects in other clusters are not similar. A variation on this is to define a cluster as a set of points that together create a region with a uniform local property, e.g., density or shape.

Classification of Clustering Algorithms

In this section, explained about the most well-known clustering algorithms. The reason for having many clustering methods is the idea of "cluster" is not exactly defined (Estivill-Castro, 2000). As a result many clustering methods are developed, with the help of dissimilar induction principle. (Farley and Raftery, 1998) put forward the method, dividing the clustering methods into two groups, one is a hierarchical and another one is partitioning methods. (Han and Kamber, 2001) put forward the method in which they categorizing the methods into additional three main categories; they are density-based methods, model-based clustering and gridbased methods. A substitute categorization based on the induction principle of the various clustering methods is explained by (Estivill-Castro, 2000).

Clustering Algorithms

Most commonly used clustering methods is as follows

Hierarchical Methods

Agglomerative Algorithms

Divisive Algorithms

Partitioning Methods

Relocation Algorithms

Probabilistic Clustering

K-medoids Methods

K-means Methods

Density-Based Algorithms

Density-Based Connectivity Clustering

Density Functions Clustering

Hierarchical Methods

These methods build the clusters by partitioning the instances in either a top-down or bottom-up approach.

These methods can be subdivided as following:

Agglomerative hierarchical clustering

Each object in clustering initially indicates a cluster of its own. Then the indicated or represented clusters are successively combined until the preferred cluster structure is formed.

Divisive hierarchical clustering

Each and every object at first belongs to only one cluster. Then the cluster is divided into more sub-clusters, which are consecutively divided into their own sub-clusters. This process is continues until the desired cluster structure is formed.

The result of this hierarchical method is a dendrogram which, representing the nested grouping of objects and change of grouping at similarity levels. A clustering of the data objects is formed by cutting the dendrogram at the preferred match level.

The merging of clusters is performed due to some similarity measure, chosen to optimize some criterion.

The hierarchical clustering methods could be further divided according to the manner that the similarity measure is calculated (Jain et al., 1999): They are

Single-link clustering

Complete-link clustering

Average-link clustering

Single-link clustering

Consider the distance between two clusters which is equal to the shortest distance from one cluster to the other cluster. If the information which consist of similarities, the similarity between the data consist of a pair of clusters is measured to be equal to the maximum similarity from any one of the member of one cluster to any of the member of the other cluster is described by (Sneath and Sokal, 1973).

Complete-link clustering

By considering the distance between two clusters which is equal to the longest distance from any member of one cluster to any member of the other cluster is described by (King, 1967).

Average-link clustering

Consider the distance between two clusters which is equal to the average distance from any member of one cluster to any member of the other cluster. Such clustering algorithms is described in (Ward, 1963) and (Murtagh, 1984).

The disadvantages of the single-link clustering and the average-link clustering can be summarized (Guha et al., 1998):

Single-link clustering has a disadvantage which is known as the "chaining effect": A few points that form a bridge between two clusters may cause the single-link clustering to unify these two clusters into one.

Average-link clustering may cause lengthened clusters to split and for portions of neighboring elongated clusters to merge.

The complete-link clustering methods usually produce more compact clusters and more useful hierarchies than the single-link clustering methods, yet the single-link methods are more versatile.

Generally, hierarchical methods are characterized with the following strengths:

Versatility

The single-link methods are used to maintain good performance on data sets which containing non-isotropic clusters, as well as well separated, chain-like clusters and concentric clusters.

Multiple partitions

Hierarchical methods produce not one partition, but multiple nested partitions, which allow different users to choose different partitions, according to the desired similarity level. The hierarchical partition is presented using the dendrogram.

Figure 6: Hierarchical Clustering

Partitioning Methods

Partitioning methods move the instances by moving them from one cluster to another, initially from an initial partitioning. Such methods require that the number of clusters will be pre-set before by the user. To achieve this global optimality in partitioned-based clustering, a comprehensive listing process of all possible partitions is required. Because this is not feasible, definite greedy heuristics are used in the form of iterative optimization procedure. Distinctively, a relocation method iteratively relocates the points between the k clusters. The subsequent subsections here became a various types of partitioning methods. This clustering algorithm includes the first ones that appeared in the Data Mining Community.

The main goal of k-means is to produce k clusters from a set of k objects, so that the squared-error objectives function is minimized.

where are the clusters, p is a point in a cluster and the mean of cluster . The mean of a cluster is given by a vector, which contains, for each attribute, the mean values of the data objects in this cluster end. Input is the number of clusters k , and as an output the algorithm returns the centers of every cluster , most of the times exclusive of the cluster identities the individual points. The distance measure generally employed is the Euclidean distance. Both for the optimization decisive factor and the proximity index, there are no restrictions, and they can be specified according to the application or the user’s preference.

The algorithm is as follows:

Select objects k as initial centers;

Allocate each data object to the closest center;

Recalculate the centers of each cluster;

Repeat steps 2 and 3 until centers do not change;

The algorithm is relatively scalable, since its complexity is, , where I denotes the number of iterations, and usually

PAM is an extension to k-means, intended to handle outliers efficiently. Instead of cluster centers, it chooses to represent each cluster by its medoid. A medoid is the most centrally located object inside a cluster. As a consequence, medoids are less influenced by extreme values; the mean of a number of objects would have to "follow" these values while a medoid would not. The algorithm chooses k medoids initially and tries to place other objects in clusters whose medoid is closer to them, while it swaps medoids with non-medoids as long as the quality of the result is improved. Quality is also measured using the squared-error between the objects in a cluster and its medoid. The computational complexity of PAM is, with I being the number of iterations, making it very costly for large n and k values.

A solution to this is the CLARA algorithm, by (Kaufman and Rousseeuw, 1990). This approach works on several samples of size s, of the n tuples in the database, applying PAM on each one of them. The output depends on the s samples and is the "best" result given by the application of PAM on these samples. It has been shown that CLARA works well with 5 samples of 40 + k size (Kaufman and Rousseeuw, 1990), and its computational complexity becomes, . Note that there is a quality issue when using sampling techniques in clustering: the result may not represent the initial data set, but rather a locally optimal solution. In CLARA for example, if "true" medoids of the initial data are not contained in the sample, then the result is guaranteed not to be the best.

The CLARANS approach works as follows:

Randomly choose k medoids;

Randomly consider one of the medoids to be swapped with a non-medoid;

If the cost of the new configuration is lower, repeat step 2 with new solution;

If the cost is higher, repeat step 2 with different non-medoid object, unless a limit has been reached (the maximum value between 250 and k(n-1);

Compare the solutions so far, and keep the best;

Return to step 1, unless a limit has been reached (set to the value of 2);

CLARANS compares an object with every other, in the worst case and for every of the k medoids. Thus, its computational complexity is, , which does not make it suitable for large data sets.

Well separated Clusters

Clusters of different sizes close to each other

Arbitrary-Shaped Clusters

Figure 7: Three applications of the k-means algorithm

Figure 7 presents the application of k-means on three kinds of data sets. The algorithm performs well on appropriately distributed (separated) and spherical-shaped groups of data (Figure 7(a)). In case the two groups are close to each other, some of the objects on one may end up with in different clusters, especially if one of the initial cluster representatives is close to the cluster boundaries (Figure 7(b)). Finally, k-means does not perform well on non-convex-shaped clusters (Figure 7(c)) due to the usage of Euclidean distance. As already mentioned, PAM appears to handle outliers healthier, since the medoids are less prejudiced by extreme values than means, which something that k-means fails to perform in an acceptable way.

Graph-Theoretic Clustering

Graph theoretic methods are methods that produce clusters by means of graphs. The edges of the graph connect the instances that are denoted as nodes. A well-known graph-theoretic algorithm is based on the Minimal Spanning Tree (MST) (Zahn, 1971). Incompatible edges are edges whose weight is considerably larger than the average of nearby edge lengths. An additional graph-theoretic approach constructs graphs based on incomplete neighborhood sets (Urquhart, 1982).

Single-link clusters are subgraphs of the MST of the data instances. Each subgraph is a connected component, that is to say a set of instances in which each instance is connected to at least one other member of the set, so that the set is maximal with respect to this property. Hence the subgraphs are produced according to some similarity threshold.

Complete-link clusters are maximal complete subgraphs, formed using a similarity threshold. A maximal complete subgraph is a subgraph such that each node is connected to every other node in the subgraph and the set is maximal with respect to this property.

Density-based Methods

A density-based method shows that the points belong to each cluster are drawn from a specific probability distribution (Banfield and Raftery, 1993). The overall distribution of the data is assumed to be a mixture of several distributions. The aim of these methods is to identify the clusters and their distribution parameters. These methods are designed for discovering clusters of arbitrary shape which are not necessarily convex, that is:

This does not necessarily imply that:

The plan is to keep on growing the given cluster as long as the density in the neighborhood that exceeds some threshold. That is to say, the neighborhood of a given radius contains at least a minimum number of objects. When each cluster is characterized by local mode or maxima of the density function, these methods are called mode-seeking. A great deal of work in this field has been based on the underlying assumption that the component densities are multivariate Gaussian or multinominal form. An acceptable solution in this case is to use the maximum likelihood principle. According to this principle, one should choose the clustering structure and parameters such that the probability of the data being generated by such clustering structure and parameters is maximized. The expectation maximization algorithm is explained by (Dempster et al., 1977). It is a general-purpose maximum possibility algorithm for mislaid data problems useful to the problem of parameter estimation. This algorithm started with an initial approximation of the parameter vector and then alternates between two steps (Farley and Raftery, 1998): an "E-step", in which the conditional expectation of the complete data likelihood given the observed data and the current parameter estimates is to be computed, and an "M-step", in which parameters that maximize the expected likelihood from the E-step are determined. This algorithm is shown to converge to a local maximum of the observed data likelihood.

The K-means algorithm may be viewed as a degenerate EM algorithm, in which:

Applying instances to clusters in the K-means possibly will measure as the E-step; forming new cluster centers possibly may be the M-step. The DBSCAN algorithm in which the clusters are discovered by arbitrary shapes and it is efficient for large spatial databases. The algorithm searches for clusters by searching the neighborhood of each object in the database and checks if it contains more than the minimum number of objects. It is described by (Ester et al., 1996).

AUTOCLASS is an algorithm widely used to cover a variety of distributions, together with Gaussian, Bernoulli, Poisson, and log-normal distributions (Cheeseman and Stutz, 1996). Other well-known density-based methods include: SNOB (Wallace and Dowe, 1994) and MCLUST is present in (Farley and Raftery, 1998).

Density-based clustering also employ as a nonparametric methods, such as searching for bins within large counts in multidimensional histogram of the input occurrence space (Jain et al., 1999).

WORKING OF BASIC CLUSTERING ALGORITHM

The K-means clustering technique is a simple technique begins with a description of the basic algorithm.

Basic K-means Algorithm is used for finding K clusters.

1. Select K points as the initial centroids.

2. Assign all points to the closest centroid.

3. Recompute the centroid of each cluster.

4. Repeat steps 2 and 3 until the centroids don’t change.

In the nonappearance of mathematical problems, this procedure always converges to a solution, although the solution is typically a local minimum. The following diagram gives an example of this. Figure 8a shows the case when the cluster centers coincide with the circle centers. This is a global minimum. Figure 9b shows local minima.

Figure 8a: A globally minimal clustering solution

Figure 8b: A locally minimal clustering solution

Choosing initial centroids

The proper initial centroids are chosen by the key step of the basic K-means procedure. It is simple and well-organized to choose initial centroids randomly, but the results are often poor. It is possible to precede a multiple runs with a dissimilar set of randomly chosen the each initial centroids is one study but this may still not work depending on the data set and the number of clusters sought. Start with a very simple example of three clusters and 16 points.

Figure 9a indicates the "natural" clustering that result when the initial centroids are "well" distributed. Figure 9b indicates a "less natural" clustering that happens when the initial centroids are poorly chosen.

Figure 9a: Good starting centroids and a "natural" clustering.

Figure 9b: Bad starting centroids and a "less natural" clustering.

Also constructed the artificial data set, shown in figure 10a as another illustration of what can go wrong. The figure consists of 10 pairs of circular clusters, where each cluster of a pair of clusters is close to each other, but relatively far from the other clusters. The probability in which an initial centroid will come from any given cluster is 0.10, but the probability that each cluster will have exactly one initial centroid is

If there is any problem as long as in two initial centroids fall anywhere in a pair of clusters, since the centroids will reallocate themselves, one to each cluster, and so achieve a globally minimal error. However, it is likely that one pair of clusters will have only one initial centroid. In that case, the pairs of clusters are far apart, the K-means algorithm will not redistribute the centroids between pairs of clusters, and thus only local minima will be achieved. When starting with an uneven distribution of initial centroids as shown in figure 10b, get a non-optimal clustering, as is shown in figure 10c, where different fill patterns indicate different clusters. One of the clusters is split into two clusters, while two clusters are joined in a single cluster.

Figure 10a: Data distributed in 10 circular regions

Figure 10b: Initial Centroids

Figure 10c: K-means clustering result

Because random sampling may not cover all clusters, other techniques are often used for finding the initial centroids. For example, initial centroids are often chosen from dense regions, and so that they are well separated, i.e., so that no two centroids are chosen from the same cluster.

HOW THE CLUSTERING METHODS OPTIMIZE INTO VARIOUS TECHNIQUES

These methods try to optimize the robust between the given data and some other mathematical models. Unlike conventional clustering, it identifies groups of objects; model-based clustering methods also find characteristic descriptions for each group, where each group represents an idea or class. The most frequently used induction methods are decision trees and neural networks.

Decision Trees

Here the data is represented by a hierarchical tree, each leaf refers to a concept and it contains a probabilistic description of that concept. Several algorithms are developed to produce a classification trees for representing the unlabelled data. The most well-known algorithms are:

COBWEB-This algorithm assumes that all attributes are independent. Its aim is to achieve high certainty of nominal variable values, given a cluster. This algorithm is not suitable for large database clustering (Fisher, 1987).

CLASSIT, an extension of COBWEB for continuous-valued data, unfortunately has similar problems as the COBWEB algorithm.

Neural Networks

This algorithm shows that each cluster is represented as a neuron or a prototype. The input data is also neurons, which are connected to the trial product of neurons. For each and every connection it has a weight, which is learned adaptively during learning process. Self-organizing map (SOM) is a popular neural network algorithm. This algorithm constructs a single-layered network. The learning process takes place in a "winner-takes-all" fashion. The prototype neurons fight for the current instance. The winner and its neighbors learn by having their weights in tune.

The SOM algorithm is successfully used for vector quantization and speech recognition. It is useful for visualizing high-dimensional data in 2D or 3D space. However, it is sensitive to the initial selection of weight vector, as well as to its different parameters, such as the learning rate and neighborhood radius.

Fuzzy Clustering

Generally conventional clustering approaches are frequently generating partitions, in a partition each occurrence is belongs to one and only one cluster. Consequently, the clusters in a hard clustering are disjointed. Fuzzy clustering (Hoppner, 2005) extends this idea and suggests a soft clustering plan. In this case, each pattern is associated with every cluster using some sort of membership function, that is to say that each cluster is a fuzzy set of all the patterns. Membership values which is to be larger, that point out the higher assurance in the obligation of the pattern to the cluster. Hard clustering is obtained by a threshold value of the membership value is obtained from the fuzzy partition.

The fuzzy c-means (FCM) algorithm is one of the popular scheme in which it is better than the hard K-means algorithm at avoiding local minima. FCM at a standstill can converge to local minima of the squared error criterion. A generalization of the FCM algorithm has been proposed through a family of objective functions. A fuzzy c-shell algorithm and an adaptive variant for detecting circular and elliptical boundaries have been presented.

The ROCK Algorithm

ROCK (RObust Clustering using linKs) (Sudipto Guha, 1999) is a hierarchical algorithm for categorical data. Guha et al. propose a novel approach based on a new concept called the links between data objects. This idea helps to overcome problems that arise from the use of Euclidean metrics over vectors, where each vector represents a tuple in the data base whose entries are identifiers of the categorical values. More precisely, ROCK defines the following:

two data objects and are called neighbors if their similarity exceeds a certain threshold given by the user, i.e.

two data objects and , define : is the number of common neighbors between the two objects, i.e., the number of objects and are both similar too.

the interconnectivity between two clusters and is given by the number of cross-links between them, which is equal to

the expected number of links in a cluster is given by . In all the experiments presented

In brief, ROCK measures the similarity of two clusters by comparing the aggregate interconnectivity of two clusters against a user-specified static interconnectivity model. After that, the maximization of the following expression comprises the objective of ROCK:

DATA

Draw Random Samples

Cluster Samples with Links

Label Data on a Disk

Figure 11: Overview of ROCK [GRS99]

A random sample is drawn and a clustering algorithm (hierarchical) is involved to merge clusters. Hence, need a measure to identify clusters that should be merged at every step. This measure between two clusters and is called the goodness measure and is given by the following expression.

Where is now the number of cross-links between clusters:

The pair of clusters for which the above goodness measure is maximum is the best pair of clusters to be merged.

Shared Nearest Neighbor Clustering

1) First the k-nearest neighbors of all the points are found. In graph conditions this can be regarded as breaking for all but the k strongest links from a point to other points in the nearness graph.

2) All the pairs of points are compared and if

a) any two points which can be share more than kt ≤ k neighbors, and

b) The two points being compared among the k-nearest neighbors of each,

It can hold that the clusters of dissimilar densities because the nearest neighbor process is a self-scaling process. This approach is transitive that is the process can be done for something or by someone, that is if point p, shares lots of near neighbors with point q, which in turn shares lots of near neighbors with point r, then points p, q and r all fit in to the same cluster. This process allows or admits this technique to handle clusters of different sizes and shapes. However, transitivity can also join clusters that shouldn’t be joined, depending on the k and kt parameters. Large values for both of these parameters tend to prevent these spurious connections, but also tend to favor the formation of globular clusters.

Genetic Algorithm using Clustering

A genetic algorithm (GA), proposed by Holland, is a search heuristic, mimicking the process of natural evolution, used for optimization and search problems. The algorithms belong to the class of evolutionary algorithms in that they use operations from evolutionary algorithms and extend evolutionary algorithms by encoding candidate solutions as strings, called chromosomes).

GAs has the following phases:

Initialization: Generate an initial population of K candidates and compute fitness.

Selection: For each generation, select µK candidates based on fitness to serve as parents.

Crossover: Pair parents randomly and perform crossover to generate offspring.

Mutation: Mutate offspring.

Replacement: Replace parents by offspring and start over with selection.

Other Techniques in Clustering

When performing clustering on categorical data, it is obvious that the techniques used are based on co-occurrences of the data objects or the number of neighbors they have, and at the same time do not deal with mixed attribute types. STIRR adopts theory from the dynamical systems area and spectral graph theory to give a solution. CACTUS employs techniques similar to the ones used in frequent item-set discovery and summarizes information in a similar way as BIRCH does.

It is a faith that there exist methods not yet applied to categorical attributes which mainly lead to more succinct result (recall that STIRR needs a painful post-processing step to describe the results). For instance, there are techniques employed by the machine learning community which are used to cluster documents according to terms they contain. It is an interest to examine the properties of these methods and investigate whether it can be effectively applied to categorical as well as mixed attribute types.

The clustering algorithm plays an important role in medical field and also in gene expression dataset.

CLUSTERING ALGORITHM FOR GENE EXPRESSION AND ITS IMPLEMENTATION

DNA microarray technology is a primary tool in organizing the gene expression. The buildup of data sets from this technology calculates the comparative quantity of mRNA of thousands of genes across tens or hundreds or thousands of samples have underscored require for quantitative methodical tools to look at such data. Owing to the bulky number of genes and complex gene ruling, clustering is a helpful to investigative the method for analyzing these data. The clustering divides the data into a small number of comparatively homogeneous groups or clusters. There is minimum two ways are present to be appropriate to apply cluster analysis to microarray data. One way is cluster arrays, in which samples from the different tissues, cells at different time points out a biological process treatment. Global expression profiles of various tissues or cellular states are classified using this type of clustering. Second way of this clustering is to cluster genes according to their expression levels across different conditions. This method needs to group co-expressed genes and make known to co-regulated genes or genes that may be involved in the same pathways.

Various clustering algorithms have been introduced for gene expression data. For instance, (Eisen, Spellman, Brown and Botstein 1998) introduce an algorithm to recognize the groups of co-regulated yeast genes called hierarchical average-linkage clustering algorithm. (Tavazoie et al. 1999) discussed about their achievement with the usage of k-means algorithm, which reduces the cluster dispersion within the clusters by iterative reallocation of cluster members.

(Tamayo et al. 1999) discussed that in yeast cell cycle and human hematopoietic separation data sets consist of clusters which are identified by self-organizing maps (SOM). Some algorithms require that every gene in the dataset belongs to one and only one cluster while others may generate fuzzy clusters, or leave some genes uncluttered. The first type is most frequently used in the literature and restricts the attention to them. In various clustering algorithms the toughest problem is relating the clustering algorithms to find an algorithm’s self-governing actions to calculate the quality of each cluster. Hence the author introduces several indices like homogeneity, outline width, superfluous scores and WADP to assess the quality of various algorithms on the NIA mouse 15K microarray data. This index uses intention information themself in the data and evaluates the clusters without any knowledge about the microarray data. Initiate with a discussion of the different types of algorithms followed by an explanation of the microarray data pre processing process. Then elaborate on the definitions of their indices and the performance measurement of the algorithm results using these indices. Look at the dissimilarity between the clusters which are produced by various types of methods and their feasible correlation to biological knowledge.

K-means

K-means is a separation algorithm in which the objects are classified into one of k groups, where k chosen as priori. Cluster membership is single-minded by manipulative the centroid for each group and conveying each object to the group with the closest centroid. This approach reduces the overall cluster dispersion by iterative reallocation of cluster members (Hartigan and Wong 1979).

In a general case, a k-partitioning or separation algorithm takes as input a set S of objects and an integer k, and outputs a partition of S into subsets . It uses the sum of squares as the optimization criterion. Let be the r th element of , and be the distance.

Between and .The sum-of-squares criterion is defined by the cost function. In particular, k-means works by calculating the centroid of each cluster denoted and optimizing the cost function. The goal of the algorithm is to minimize the total cost:

The performance of the k-means algorithm used in this study was the one in S-plus (MathSoft, Inc.), which initializes the cluster centroids with hierarchical clustering by default, and thus gives outcomes. The output of the k-means algorithm includes the given number of k clusters with their respective centroids.

PAM (Partitioning Around Medoids)

PAM is a k partitioning approach, which is used to cluster the types of data in which the mean of objects is not defined (Kaufman and Rousseuw (1990)). Their algorithm finds the representative object (i.e., medoid, which is the multidimensional version of the median) of each Si, denoted , uses the cost function and tries to minimize the total cost.

Implementation of PAM in the S-plus then finds a local minimum for the objective function, that is, a solution such that there is no single switch of an object with a medoid that will decrease the total cost.

Hierarchical Clustering

Partitioning clustering formulates an initial number of groups, and iteratively reallocates objects among groups to convergence. But, hierarchical clustering combine or partition existing groups, generating a hierarchical structure that reflects the order in which groups are combined or divided. The agglomerative technique generates the hierarchy by merging, the objects initially belong to a list of singleton sets then a cost function is utilized to identify the pair of sets from the list that is "cheapest" to merge. Once merged, are elimanted from the list of sets and replaced with . This process iterates until all objects are in a single group.

SOM (Self-organization map)

SOM uses cooperation method to attain unsupervised learning. In the traditional SOM, a set of nodes is assembled in a geometric pattern. Each node is connected with a weight vector with the same dimension as the input space. The main aim of SOM is to identify a significant mapping from the high dimensional input space to the 2−D representation of the nodes. SOM can be used for clustering through considering the objects in the input space denoted by the same node as grouped into a cluster. During training, each object in the input is mapped and the best matching node is found.

CLUSTERING PREREQUISITES IN GENE EXPRESSION

Clustering GE usually involves below steps (A. K. Jain et al. 1999):

(1) Pattern representation: It involves in the demonstration of the data matrix for clustering, number, type, dimension and scale of GE profiles available. A number of these were set during execution of the experiment; on the other hand, definite features are controllable, such as scaling of measurements, imputation, normalisation techniques, representations of up/down-regulation etc. An optional step of feature selection can be carried out.

These are two distinctive procedures in which the former refers to selecting a subset of the original features. It would be most effective to use in the clustering procedure. Latter by the use of transformations of the input features it produces a new prominent feature that may be more partial in the clustering procedure, e.g. Principal Component Analysis.

(2) Definition of pattern proximity measure: Typically measured a distance between pairs of genes. On the other hand, conceptual measures can be used to characterize the similarity among a group of gene profiles e.g. Mean Residue Score of Cheng and Church.

(3) Clustering the data: To find structures (clustering) in the dataset a clustering algorithm is used. Clustering methods can be broadly categorized according to the classification due to (A. K. Jain et al. 1999).

(4) Data abstraction: Representation of structures found in the dataset. In GE data, this is usually human orientated, so data abstraction must be easy to interpret. It is usually a compact description of each cluster, through a cluster prototype or representative selection of patterns within the cluster, such as cluster centroid.

(5) Assessment of output: Validity of clustering results is essential to cluster analysis of GE data. A cluster output is valid if it cannot reasonably be achieved by chance or as an artifact of the clustering algorithm. Validation is achieved by careful application of statistical methods and testing hypotheses. These measures can be categorized as:

(i) Internal validation,

(ii) External validation and

(iii) Relative validation.

REQUIREMENTS FOR CLUSTERING ANALYSIS

Typical Problems and Desired Characteristics

The desired characteristics of a clustering algorithm depend on the particular problem under consideration.

Scalability

Clustering techniques for large sets of data must be scalable, both in terms of speed and space. It is not unusual for a database to contain millions of records, and thus, any clustering algorithm used should have linear or near linear time complexity to handle such large data sets. (Even algorithms that have complexity of O(m2) are not practical for large data sets.) Some clustering techniques use statistical sampling. Nonetheless, there are cases, e.g., situations where relatively rare points have a dramatic effect on the final clustering, where a sampling is insufficient.

Furthermore, clustering techniques for databases cannot assume that all the data will fit in main memory or that data elements can be randomly accessed. These algorithms are, likewise, infeasible for large data sets. Accessing data points sequentially and not being dependent on having all the data in main memory at once are important characteristics for scalability.

Independence of the order of input

Some clustering algorithms are dependent on the order of the input, i.e., if the order in which the data points are processed changes, then the resulting clusters may change. This is unappealing since it calls into question the validity of the clusters that have been discovered. They may just represent local minimums or artifacts of the algorithm.

Effective means of detecting and dealing with noise or outlying points

A point which is noise or is simply an atypical point (outlier) can often distort a clustering algorithm. By applying tests that determine if a particular point really belongs to a given cluster, some algorithms can detect noise and outliers and delete them or otherwise eliminate their negative effects. This processing can occur either while the clustering process is taking place or as a post-processing step.

However, in some instances, points cannot be discarded and must be clustered as well as possible. In such cases, it is important to make sure that these points do not distort the clustering process for the majority of the points.

Effective means of evaluating the validity of clusters that are produced.

It is common for clustering algorithms to produce clusters that are not "good" clusters when evaluated later.

Easy interpretability of results

Many clustering methods produce cluster descriptions that are just lists of the points belonging to each cluster. Such results are often hard to interpret. A description of a cluster as a region may be much more understandable than a list of points. This may take the form of a hyper-rectangle or a center point with a radius. Also, data clustering is sometimes preceded by a transformation of the original data space – often into a space with a reduced number of dimensions. While this can be helpful for finding clusters, it can make the results very hard to interpret.

The ability to find clusters in subspaces of the original space

The clusters frequently occupy a subspace of the full data space. Hence, the popularity of dimensionality reduction techniques is used. Many algorithms have difficulty finding, for example, a 5 dimensional cluster in a 10 dimensional space.

The ability to handle distances in high dimensional spaces properly

High-dimensional spaces are quite different from low dimensional spaces. In [BGRS99], it is shown that the distances between the closest and farthest neighbors of a point may be very similar in high dimensional spaces. Perhaps an intuitive way to see this is to realize that the volume of a hyper-sphere with radius, r, and dimension, d, is proportional to rd, and thus, for high dimensions a small change in radius, means a large change in volume. Distance based clustering approaches may not work well in such cases. If the distances between points in a high dimensional space are plotted, then the graph will often show two peaks: a "small" distance representing the distance between points in clusters, and a "larger" distance representing the average distance between points. If only one peak is present or if the two peaks are close, then clustering via distance based approaches will likely be difficult. Yet another set of problems has to do with how to weight the different dimensions. If different aspects of the data are being measured in different scales, then a number of difficult issues arise. Most distance functions will weight dimensions with greater ranges of data more highly. Also, clusters that are determined by using only certain dimensions may be quite different from the clusters determined by using different dimensions. Some techniques are based on using the dimensions that result in the greatest differentiation between data points. Many of these issues are related to the topic of feature selection, which is an important part of pattern recognition.

Ability to function in an incremental manner

In certain cases, e.g., data warehouses, the underlying data used for the original clustering can change over time. If the clustering algorithm can incrementally hold the calculation of new data or the deletion of old data, then this is usually much more efficient than re-running the algorithm on the new data set.

APPLICATIONS OF CLUSTERING

Biology, computational biology and bioinformatics

Plant and animal ecology

Cluster analysis is used to explain and to make spatial and temporal comparisons of communities of organisms in heterogeneous environments; it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes

Transcriptomics

Clustering is used to build groups of genes with related expression patterns (also known as coexpressed genes). Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated. High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation, a general aspect of genomics.

Sequence analysis

Clustering is used to group homologous sequences into gene families. This is a very important concept in bioinformatics, and evolutionary biology in general. See evolution by gene duplication.

High-throughput genotyping platforms

Clustering algorithms are used to automatically assign genotypes.

Human genetic clustering

The similarity of genetic data is used in clustering to infer population structures.

Medical imaging

On PET scans, cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image. In this application, actual position does not matter, but the voxel intensity is considered as a vector, with a dimension for each image that was taken over time. This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of arterial blood, an intrusive technique that is most common today.

IMRT segmentation

Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.

Market research

Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers, and for use in market segmentation, Product positioning, new product development and Selecting test markets.

Grouping of shopping items

Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products.

Social network analysis

In the study of social networks, clustering may be used to recognize communities within large groups of people.

Search result grouping

In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google. There are currently a number of web based clustering tools such as Clusty.

Slippy map optimization

Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter.

Software evolution

Clustering is useful in software evolution as it helps to reduce legacy properties in code by reforming functionality that has become dispersed. It is a form of restructuring and hence is a way of directly preventative maintenance.

Image segmentation

Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.

Evolutionary algorithms

Clustering may be used to identify different niches within the population of an evolutionary algorithm so that reproductive opportunity can be distributed more evenly amongst the evolving species or subspecies.

Recommender systems

Recommender systems are designed to recommend new items based on a user's tastes. They sometimes use clustering algorithms to predict a user's preferences based on the preferences of other users in the user's cluster.

Markov chain Monte Carlo methods

Clustering is often utilized to locate and characterize extrema in the target distribution.

Crime analysis

Cluster analysis can be used to identify areas where there are greater incidences of particular types of crime. By identifying these distinct areas or "hot spots" where a similar crime has happened over a period of time, it is possible to manage law enforcement resources more effectively.

Educational data mining

Cluster analysis is for example used to identify groups of schools or students with similar properties.

Field robotics

Clustering algorithms are used for robotic situational awareness to track objects and detect outliers in sensor data.

Mathematical chemistry

To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 topological indices.

Climatology

To find weather regimes or preferred sea level pressure atmospheric patterns.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now