Basic Steps Of Fcm Mode Clustering Algorithm

Published Date: 02 Nov 2017

ABSTRACT

The main objective is to design and develop a clustering algorithm for finding similar sub sets from crime data. In this paper, we suggest a method for developing an algorithm and modify the existing technique in three ways such as i) new attribute weightage scheme instead of IGR ii) suitability to mixed data and iii) using FCM-based clustering instead of k-means. Generally, the effectiveness of clustering algorithm is completely based on distance matching that finds the similarity between data records and centroid. Giving the equal importance for all the attributes is not much effective in clustering process. Instead attribute weightage could be included in distance matching. A weight vector is generated based on mutual information. In the mutual information formula, the numerator considers the probability of attribute values within the non-overlapping interval together in all the classes and denominator considers the probability of attribute values within overlapping interval individually in all the classes. The method for attribute weightage is common for both numerical and categorical data. Finally, the grouping of similar sub sets is done based on FCM-based clustering procedure in which the distance matching is carried out based on the attribute weights. The experimentation is done using the publicly available crime datasets and the performance of the proposed clustering algorithm is analyzed based on clustering accuracy with the existing algorithm.

Keywords: FCM clustering, crime data, overlapping interval, non overlapping interval, Numerical data, Categorical data.

1. INTRODUCTION

In recent years, volume of crime is becoming serious problems in many countries. Todayâ€™s criminals have maximum use of modern technologies and high-tech methods which serve up criminals to commit crimes at an immense measure[1]. The law enforcers have to effectively meet out challenges of crime control and maintain public law and order. In recent years, data clustering techniques have faced several new challenges including simultaneous feature subset selection, large scale data clustering and semi-supervised clustering. Cluster analysis is an important human activity which indulge from childhood when learn to distinguish between animals and plants, etc by continuously improving subconscious clustering schemes. It is widely used in numerous applications including pattern-recognition, data analysis, image processing, and market research, etc[2]. Recent researchers on these techniques link the gap between clustering theory and practice of using clustering methods on crime applications[3]. Cluster accuracy can be improved to capture the local correlation structure by associating each cluster with the combination of the dimensions as independent weighting vector and subspace span which is embedded on it.[4,5]. One of the major applications of clustering is crime analysis that has become one of the most essential activities in the world, since the technology development and the high growth of community have resulted a high magnitude of crimes, most of the time with bizarre patterns [6,7]. Collection of crime data, categorize crime data according to the crime type, analyze and identify important crime hot- spots and prediction and prevention over the protection safety of future crimes are very needful in crime analysis. There are a number of analysis routines for identifying and assessing potential hot spots in crime analysis are the mode, the fuzzy mode, hierarchical nearest neighbor clustering, risk adjusted nearest neighbor hierarchical clustering, the Spatial and Temporal Analysis of Crime routine, K-means clustering, and the local Moran statistic [8,9]. Among the recent algorithms for this application is AK-Modes that has turn into and successfully progress the efficiency compared with the established one and can help in the decision-making process for categorical data. The number of clusters, K, must be supplied as a parameter is the main disadvantage in the k-means algorithm[18].

In this paper, we present a new approach for attribute weighted dissimilarity that is proposed to design and develop a clustering algorithm for finding similar sub sets for crime data. This weightage includes numerical and categorical dataset. The distance values are measured individually for both numerical and categorical dataset weightage values and is used to find an accurate distance matching of the data set. In the attribute weightage, overlapping and non overlapping interval method has adopted to find the weightage function. Then, FCM method is used to cluster the data records. The experimentation is done using the publicly available crime datasets and the performance of the proposed clustering algorithm is analyzed based on clustering accuracy with the existing algorithm.

The organization of the paper is as follows. Section 2 presents the review of related works. Problem definition and contribution of the paper present in Section 3. The proposed technique of new attribute weightage method for finding similar sub sets from crime data is calculated in Section 4. An experimental results discussed in Section 5 and section 6 concludes the paper.

2. REVIEW OF RELATED WORKS

Literature review presents some of the clustering techniques and various applications especially crime applications are discussed. Self Organization Map (SOM) and K-means methods have been used to evaluate the patterns for clusters. Furthermore, the lack of pattern quality assessment in spatial clustering could lead to meaningless or unknown information. Validity of SOM and K-means methods has been examined using compactness and separation criteria. In this crime application, data has been separated into two parts. First part contains simulated data which has 2D x,y coordinates and subsequent part of data was real data corresponding to crime investigation. The result could be used to classify the study area which is based on the property crimes[19].

Sungsoon Hwang and Jean-Claude, Thill have proposed a fuzzy clustering method in delineating urban housing submarkets relative to clustering methods based on classic (or crisp) set theory. A fuzzy c-means algorithm was applied to obtain fuzzy set membership degree of census tracts to housing submarkets defined within a metropolitan area. Issues of choosing algorithm parameters were discussed on the basis of applying fuzzy clustering to 85 metropolitan areas in the U.S. The comparison between results of fuzzy clustering and those of crisp set counterpart .The result showed that the fuzzy clustering yields statistically more desirable clusters[15].

M.A.P. Chamikara et al. have proposed an automated system, SL-SecureNet, which was developed based on data mining platform. The system makes Crime locating and recording as a very simple and quick task. Therefore, SL-SecureNet would evade most of the problems that appear in the current manual crime recording and analysis system and would improve efficiency in making timely decisions on security arrangements. It assists the police department to provide high quality security service[16]. Kilian Stoffel, Paul et al. have proposed a methodology and an automatic procedure, based on fuzzy set theory and designed to infer precise and intuitive expert-system-like rules from original forensic data, was described. The main steps of the methodology were detailed, as well as the experiments conducted on forensic data sets - both simulated data and real data, representing robberies and residential burglaries. This proposed methodology consists of three main phases (fuzzy clustering data, membership function extraction and fuzzy rules generation) was proved and easily implemented in most data analysis environments. The accuracy of the inferred rules was clearly higher than the minimum level required making them usable in a practical setting[14].

Paul A. Norris [13] has considered country level victimization rates from ICVS-5, using clustering techniques that was used to identify groups of nations which exhibit similar levels and patterns of victimization. It was argued that the clusters of nations present in the ICVS data reflect which also found in other areas of social policy. Brief consideration was also given to how the application of typologies from social policy can suggest questions, and provide insights, to the study of comparative criminology.

3. PROBLEM IDENTIFICATION AND CONTRIBUTIONS

The k-means algorithm is the most heavily used clustering algorithm because it is used as an initial process for many other algorithms; the k-means problem is based on a simple iterative scheme for not finding a locally minimal solution. The main disadvantage of the k-means algorithm is the problem of choosing similarity measures. The traditional algorithms are used for numerical data and must be modified to take into consideration of the specific characteristic of categorical data. Recently, Ak-mode was proposed based on attribute weighted scheme and k-means clustering procedure. In the existing work, information gain ratio was used to find out attribute weightage. If we consider the mutual information among the various classes for attribute weightage computation, the performance would be significantly improved. Accordingly, we proposed a new attribute weightage formula with the idea of mutual information. Along this, the dissimilarity measure is designed for both numerical and categorical data sets. Additionally, grouping of similar subsets was done using FCM clustering procedure in which it is adapted to handle both numerical and categorical data.

In this paper, we present three contributions that are given as follows,

A new attribute weighted dissimilarity measure is proposed, which is applied to the FCM Clustering algorithm for mixed crime data. The updating formulae of the FCM clustering algorithm with new weighted dissimilarity measures are derived.

The new dissimilarity measure is integrated with new attribute weighted dissimilarity measure to form a mixed weighted dissimilarity measure. Based on this, a mixed attribute weighting algorithm is proposed to cluster high-dimensional for both numerical data and categorical data.

The performance of the mixed attribute weightage algorithm is investigated for both Numerical and Categorical data of data sets.

4. PROPOSED ATTRIBUTE WEIGHTED FUZZY CLUSTERING ALGORITHM FOR MIXED CRIME DATA

Generally, the effectiveness of clustering algorithm is completely based on distance matching that finds the similarity between data records and centroids. Giving the equal importance for all the attributes is not much effective in clustering process. So, attribute weightage should be included in distance matching. Here, we have presented a mutual information-dependent formula that is used to generate a weight vector for entire attributes of input mixed data. Finally, the grouping of similar sub sets is done based on FCM-based clustering procedure in which the distance matching is carried out based on the attribute weights. In this paper, we attain for finding similar sub sets from crime dataset by the following three steps, namely

Step 1: New attribute weightage scheme

Step 2: Proposed distance measure

Step 3: Adapting FCM clustering for mixed data

4.1 New attribute weightage scheme

In the new attribute weightage scheme, we are using two weightage methods called the Weightage for Numerical data value and Weightage for categorical data value with the help of overlapping and non overlapping data values. Finally we put both the weightage values in the distance measure formula and the distance between the two attributes are calculated.

Weightage for Numerical data value: The weight of a numerical attribute which indicates the importance of the attribute in different crime cases. The larger the weight is the more important the attribute is in that case category.

Weightage = (4.1.1)

= weightage for Numerical data

In this crime data set, the weightage for numerical data value is computed with the help of overlapping and non overlapping intervals. In numerical and categorical methods, overlapping and non overlapping are defined in two different ways. In numerical method, each class contains minimum to maximum data value for using derived numerical data. In case of categorical method, non overlapping is the maximum number of attributes and overlapping is the minimum number of attributes in the class.

Non overlapping interval

= (4.1.2)

In the Non overlapping for Numerical data (), the numerator considers the probability of attribute values within the non-overlapping intervals together in all the classes and denominator considers the probability of attribute values within non overlapping interval individually in all the classes. The above equation (4.1.2) only measures the Non overlapping value for numerical data

(4.1.3)

The probability of Non overlapping value in all the classes is defined as the ratio of the summation of found the non overlapping value in each classes and the total number of attributes, here we are using the above equation (4.1..3) to calculate the total number of non overlapping values for crime datasets.

(4.1.4)

The probability of individual classes for the non overlapping values which is defined as the ratio of non overlapping attributes in individual classes and the total number of attributes in those classes, here we are using this formula (4.1.4) to find out the number of non overlapping values of the individual classes.

= (4.1.5)

In the Overlapping for Numerical data (), the numerator considers the probability of attribute values within the overlapping interval together in all the classes and denominator considers the probability of attribute values within overlapping interval individually in all the classes. The above equation (4.1.5) measures the Overlapping value for numerical data

(4.1.6)

The probability of overlapping value is computed for all the classes . It is defined as the ratio of the summation of found overlapping value in each classes and the total number of attributes. Here, we use the above equation (4.1.6) to calculate the total number of overlapping value for the crime dataset

(4.1.7)

The probability of individual class is measured for the overlapping value which is defined as the ratio of non-overlapping attributes in individual classes and the total number of attributes in those classes, here we have used this formula (4.1.7) to find out the number of overlapping values in the individual classes.

Weightage for Categorical data value: The weight of Categorical attribute is a data value which indicates the importance of the attribute in different crime cases. The larger weight is the more important in case category.

Weightage = (4.2.1)

= weightage for Categorical data

Where, are non overlapping and overlapping data for categorical attributes. A categorical attribute is one whose value does not have a natural ordering. Some typical categorical attributes are our behavioral attributes and they are usually describe an offenderâ€™s trait, location, often category, sub category and so on.

= (4.2.2)

In the Non overlapping for Categorical data (), the numerator considers the probability of maximum attribute values within the non-overlapping interval together in all the classes and denominator considers the probability of maximum attribute values within overlapping interval individually in all the classes. The above equation (4.2.2) only measures the Non overlapping value for Categorical data

(4.2.3)

The probability of found the maximum attributes in all the classes is. It is defined as ratio of the summation of found the maximum attributes in each classes to the total number of attributes. Here, we use the above equation (4.2.3) to calculate the total number of maximum attributes of crime dataset

(4.2.4)

The probability of individual classes are measured as the maximum attributes which is defined as the ratio of non-overlapping attributes in individual classes to the total number of attributes in those classes, here we use this formula (4.2.4) to find out the number of overlapping values in the individual classes

(4.2.5)

In the overlapping for categorical data (), the numerator considers the probability of minimum attribute values within the overlapping interval together in all the classes and denominator considers the probability of minimum attribute values within overlapping interval individually in all the classes. The above equation (4.2.5) only measures the overlapping value for Categorical data

(4.2.6)

The probability of found the minimum attributes in all the classes .It is defined as ratio of the summation of find out the maximum attributes in each classes and the total number of attributes.here we are using the above equation (4.2.6) to calculate the total number of minimum attributes of crime dataset

(4.2.7)

The probability of individual classes are measured the maximum attributes which is defined as the ratio of non overlapping attributes in individual classes to the total number of attributes in those classes, here we are using this formula (4.2.7) to find out the number of overlapping values of individual classes.

4.2. Proposed distance measure:

The distance matching is carried out based on attribute weights and two separate distance measures for numerical and categorical data. The below equation has two separate sections contain the formula for Numerical and Categorical data values.

(4.2.1)

Where, = distance measure for numerical data

= distance measure for categorical data

= weightage for numerical data

= weightage for categorical data

m1 = numerical object

m2 = categorical object

Definition 1: (Distance Measure for numerical data)

The distance measure is used for computing the distance between the two data points with respect to its numerical attributes is Euclidean distance. The Euclidean distance is computed as follows:

Definition 2: (Dissimilarity Measure for categorical data)

The dissimilarity of two data points for the categorical attributes, and is computed based on the following equations.

4.3 FCM clustering algorithm for both Numerical and categorical data:

Let us consider a mixed data of set of categorical objects and set of numerical objects. The Numerical objects to be clustered, where each object and categorical objects to be clustered, where each object is defined by attributes that contains m1 categorical and m2 numerical. Then, this problem can be mathematically reformulated as follows:

(4.3.1)

Where is any real number greater than 1, is the degree of membership of in the cluster , is the of d-dimensional measured data, - the d-dimension center of the cluster, and ||*|| -is the similarity between any measured data and the center.

An iterative optimization of fuzzy partitioning with the update of membership is

(4.3.2)

This iteration stops whenhttp://home.dei.polimi.it/matteucc/Clustering/tutorial_html/images/image027.gif, whereÂ http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/images/image002.gifÂ is a termination criterion between 0 and 1, whereasÂ kÂ is the iteration steps. Local minimum converges in this procedure.

Basic steps of FCM-mode clustering algorithm:

Initialize U=[] matrix, U(0)

Update ,

3. Centroid computation for numerical data based on definition 3 and categorical data based on definition 4.

4. If, then STOP.

Definition 3: Centroid updating for numerical data

The center of the cluster for any numerical attribute is computed based on membership values and numerical data value.

(4.3.3)

Here,

- cluster dimension center.

- Numerical data object

Definition 4: (Updating of k-centroids for categorical data)

The centroids of categorical attributes within the cluster are arranged in accordance with their relative frequency. The category with high relative frequency of â€˜â€™ categorical attributes is chosen for the new representative. For example, gender is a categorical attribute having two categories (male and female) and location is also a categorical attribute having a number of categories (Australia, inner Sydney, Liverpool, Campbell town and more).

AWFCM-Modes Algorithm

Input: Dataset D, Weighted Attributes for numerical and categorical data (,)

Output: clustering result

Step 1: Initialize U=[] matrix, U(0)

Step 2: Updating the membership , for numerical and categorical data

Step 3: Centroid computation for numerical data based on definition 3.

Step 4: Centroid computation for categorical data based on definition 4.

Step 5: Go to step 2, until, then STOP.

5. RESULT AND DISCUSSION

Experimental set up

The experimental results of the proposed method for an attribute weighted Fuzzy clustering algorithm for mixed crime data. The proposed approach has been implemented in Mat lab 7.12 and the experimentation is performed on a 3.0 GHz Pentium PC machine with 2 GB main memory. For experimentation, we have taken new approach for attribute weighted dissimilarity that is design and develop a clustering algorithm for finding similar sub sets from crime data.

Dataset description

The proposed system is experimented with two widely applied datasets namely, crime and hepatitis. Here, Crime data is taken from publicly available source and Hepatitis is taken from the UCI machine learning repository [17].

Evaluation Metrics

The Cluster Accuracy is used to evaluate the performance of the proposed approach of the crime dataset. The evaluation metric is given below,

Clustering Accuracy, (5.3.1)

where, ïƒ Number of data points in the dataset

ïƒ Number of resultant cluster

ïƒ Number of data points occurring in both cluster and its corresponding class.

5.4 Performance evaluation

The experimental results of the proposed approach are shown in figure 1, figure 2, figure 3 and figure 4. We calculate the clustering accuracy of the resultant clusters by changing the k-value (order of initial clustering) and at the same time, clustering error is also calculated. The obtained results are plotted as graphs shown in figure 1, figure 2, figure 3 and figure 4.In this work, we adopted AWFCM mode for the clustering approach. In this section the performance evaluation of the proposed AWFCM mode is compared with AK mode. The experimental results are discussed below.

In fig(1) and fig(2), the performance comparisons are illustrated for different datasets. From the figure (1), it is determined that AK mode has 53% accuracy, whereas AWFCM with 58%. In figure (2), the observation gives 70% accuracy in AK MODE and 100% accuracy in AWFCM mode. Infig (3) and fig(4), the performance comparisons are illustrated for different datasets. From the figure (3), it is determined that AK mode has 53% accuracy, whereas AWFCM with 58%. In figure (4), the observation gives 70% accuracy in AK-Mode and 80% accuracy in AWFCM mode.

C:\Users\apple\Desktop\results_50_iteration-2\crime_db.x_clustering_acc.jpg

Figure 1: clustering accuracy of dataset 1(AK- Mode) and data set 2(AWFCM) for 50 iteration

C:\Users\apple\Desktop\results_50_iteration-2\hepatitis_db.t_clustering_acc.jpg

Figure 2: clustering accuracy of dataset 1(AK- Mode) and data set 2(AWFCM) for 50 iteration

C:\Users\apple\Pictures\Google Talk Received Images\crime_db.x_clustering_acc.jpg

Figure 3: clustering accuracy of dataset 1(AK- Mode) and data set 2(AWFCM) for 25 iteration

C:\Users\apple\Pictures\Google Talk Received Images\hepatitis_db.t_clustering_acc.jpg

Figure 4: clustering accuracy of dataset 1(AK- Mode) and data set 2(AWFCM) for 25 iteration

6. Conclusion

In this paper, we suggested a method for attribute weightage of numerical and categorical data using overlapping and non-overlapping interval. Then, attribute weightage-based clustering algorithm was developed for finding similar sub sets from crime data. The proposed technique taken AK-mode as a motivational research and modified the existing technique in a three ways, such as i) new attribute weightage scheme instead of IGR, ii) suitability to mixed data, and iii) Using FCM-based clustering instead of k-means. A FCM- based clustering approach is proposed in this paper to find similar sub sets from crime data. The performance of the proposed clustering algorithm is analyzed based on clustering accuracy with the AK-mode algorithm and the comparisons are made. From the experimental results it is observed that proposed FCM method has better accuracy 88% than the AK-mode. So it is concluded the AK-mode is outperformed by the proposed clustering approach.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now