Pattern Recognition And Classification

Published Date: 02 Nov 2017

Submitted to the

Institute of Graduate Studies and Research

in partial fulfillment of the requirements for the Degree of

Master of Science

Computer Engineering

Eastern Mediterranean University

January 2013

GazimaÄŸusa, North Cyprus

Approval of the Institute of Graduate Studies and Research

Prof. Dr. Elvan YÄ±lmaz Director

I certify that this thesis satisfies the requirements as a thesis for the degree of Master of Science in Computer Engineering.

Assoc. Prof. Dr. Muhammed Salamah

Chair, Department of Computer Engineering

We certify that we have read this thesis and that in our opinion it is fully adequate in scope and quality as a thesis for the degree of Master of Science in Mechanical Engineering.

Prof. Dr. Hakan Altincay

Supervisor

Examining Committee

1. Prof. Dr. Hakan Altincay

2. Prof. Dr. Hasan KÃ¶mÃ¼rcÃ¼gil

3. Assoc. Prof. Dr. Ekrem VaroÄŸlu

ABSTRACT

Ã–ZET

.

DEDICATION

ACKNOWLEDGMENT

LIST OF TABLES

LIST OF FIGURES

Chapter 1

1 INTRODUCTION

Pattern classification

Pattern recognition and classification is the science of grouping data by labeling input data as one of the known classes. Some examples of data are speech signal, face image, eye image, handwritten word and e-mail message. Mostly, the pattern recognition algorithms are trying to match input data considering their statistical variation.

Classification algorithms will label each data sample to a given set of classes. A class is a group or kind of things that may be grouped into a category based on their common properties. For example, in a face recognition problem, any person is a class. If we need to design an automated system for fish packing then any different type of fish is a different class. As a way to represent objects (e.g. faces, fish) the feature representation is the most common. For the recognition of simple real-world objects, the features can be their sizes, shapes, colors, and etc. in another words, a feature is any distinctive aspect, quality or characteristic.

A feature vector of an object represents it in a statistical environment which is a combination of d features as a d-dim column vector, while d is the number of features per object in that class.

The piece of input data which is expressing an object instance is formally termed a sample and a collection of samples is termed a dataset. For example, in face recognition problems, any face image that stored in the dataset is a sample.

A pattern recognition system is typically made up by two phases, training phase and test phase. In the training phase a training dataset is using to compute a decision boundary or decision regions in the feature space of the train dataset. Training examples are classified as correctly as possible. For this manner there will be a training dataset which has data and also correct labels. The algorithm will analyze the training dataset and produces a model which will use to classification. This analyzing called training phase. Accuracy of these algorithms is testifying by running the model and executing the decision making over unseen samples, which is the testing phase.

Classification problems which is more general problem of pattern recognition are learning algorithms where train samples have two values which shows features vector and its related class [1].

Classification roughly categorized into two groups: Parametric and Non-Parametric methods. The Parametric method is maximum likelihood estimation which is calculating the probability for all feature vectors in given dataset. They fit a parametric model to the training data and interpolate to classify test data. In other words, the parametric methods assume a specific function form for probability density function and optimize the function parameters to fit data. These methods are Linear Discriminant Classifiers (LDC) and Quadratic Discriminant Classifier (QDC). In the Non-Parametric methods there is no assumptions are made about probability density function for each class, because the function may not necessarily be true and a classifier may perform better. Therefore, the Non-Parametric methods determine the form of the probability density function from the data and may grow parameter set size with the amount of data. These methods are k-NN, Neural Networks, SVM, and etc. [2]. Comparing Non-Parametric to Parametric methods, they are lazy learner and almost all the work is done at test time and they memorize training dataset. Parameter methods, assume an estimation parameter while Non-Parameters methods directly evaluate how dens is the vicinity of the sample.

Nearest Neighbor Classification (NN) is a simple yet effective scheme for classifying samples according to their closest training sample in training dataset [3]. A modified version is k-NN which makes the decision based on its k nearest neighbors. The training phase is light, and stores all data samples and their labels. To classify a sample, find k closest samples from training dataset and find the most common label among these samples. In case of having real numbers as a feature vector, the most common function for calculation distance is Euclidean distance [4].

Although, it is robust to noisy training data and is easy to implement and debug, this approach has some disadvantages which are high computation cost, hard to choose best distance type and which attributes, and need to determine value of parameter K [5]. Main disadvantage is need for large number of samples for reasonable performance. In this method, as a geometrical neighborhood approach, the performance goes higher as the number of samples increased [2] while in practice there will be restriction in number of samples. For example, in face or finger print recognition problems, since the number of samples are very limited, error rate is considerable. To covering above problems, NFL method is proposed.

NFL classification

Nearest Feature Line (NFL) method suggests to use a line â€“called Feature Line- between each pair of samples of same class [3]. Here, the main goal is to generalize the representational capacity of a class by considering FLs. It has been proved to achieve consistently better performance than the NN in term of the error rate in real application and also in artificial data [6]. In face recognition classification, while the number of samples is very limited, NFL is mainly used, it effectively improves the classification performance especially when the number of samples per class is small [3]. NFL adds extra information by creating feature lines which is trespassing from each pair of samples from same class. Classification by NFL is done by finding the nearest feature line to the query point therefore the class has the nearest feature line is selected as the decision.

This feature lines consists of interpolation and extrapolation segments. Every feature line consist of three segments, one of them -the inner one- is created by interpolation of point, another one is forward extrapolation segment and the last one is backward extrapolation segment. If an interpolation segment passes from other a class cluster this reduce the accuracy which is termed as interpolation inaccuracy. In this case some sample from that class may misclassified by this segment. Another scenario is the extrapolation part of a feature line passing through other class clusters and cause extrapolation inaccuracy. In above problems, there will be a close FL which its sample points are very far away from the point or from each others [7]. Accordingly, beside the complexity of computation, this method has interpolation and extrapolation inaccuracies. Following this technique, several editions are developed to reduce the error and or the cost. Center-based nearest neighbor (CNN) [8] was proposed to reduce the computational cost of the NFL method by using center-based line (CL) which is a line passing through each training sample and the center of the sample corresponding class, instead of using two training sample point which creating a FL [8, 9]. In the CNN method the classification decision is done by finding the nearest CL to the query point. The comparison with NN and NFL classifiers prove that CNN achieves enhanced performance [8]. Another approach for reducing the computation cost is Nearest neighbor line (NNL) [10] that uses the line by closest points for each class. In this case, instead of calculation for all FLs, just one line for each class is considered. Experiments on face recognition problems shows that NNL has much lower computation time and achieves competitive performance comparing to the NFL method [11].

All mentioned above methods still have the FLs interpolation and extrapolation problems. More advanced methods were proposed to suppress the drawbacks of NFL. Rectified nearest feature line segment (RNFLS) [12] generates subspaces built from FL segments by removing all the feature lines trespassing the territory of other classes (to avoid interpolation inaccuracy) and classifying by finding out the NFL segment to the query point (to avoid extrapolation inaccuracy). To avoid extrapolation in accuracy, RNFLS is using the segmentation and just considering the interpolation segment. The query point q has a projection point p in the FL. If p is on the extrapolation segment, the projection point will be replaced by nearest point of the FL. This process is cutting off the feature line in order to preserve only the interpolation part. RNFLS defines two type of territory. A sample territory for sample point x is hypersphere centered by x with radius equals to distance to nearest neighborhood fro other class. A class territory is the union of all sample territories of sample point belongs to corresponding class.

Shortest feature line segment (SFLS) [9] avoids extrapolation inaccuracy and in some cases it avoids interpolation inaccuracy by attempting the shortest FL segment while it satisfying geometric relation. The decision is done by finding the smallest hypersphere that contains the segment. These hyperspheres are conscious by the other segments as the diameter. SFLS is not calculating distances.

As described above some feature lines can cause interpolation and extrapolation inaccuracy problems. As an alternative approach to improve the performance the NFL method an editing can be applied to remove the feature lines leading to misclassification.

Objectives

The Major aim of this study is to propose an editing based selection of feature lines to reduce interpolation and extrapolation inaccuracy in NFL. The proposed method is based on the development of an algorithm which iteratively evaluates the decrement or increment of inaccuracy by deleting some segments in several steps including ranking, intersection detecting and pruning.

As a first step, a ranking schema is applied. For each segment, we calculate and record the number of correct and incorrect classification that it makes (negative and positive ranks). Then sort out the negative ranks and consider the most negative rank. Check if removing the corresponding segment leads to a better classification by comparing before removing and after removing ranks of that particular segment. If that was a good decision, mark that segment as a deleted one. This step is repeating while there is no more segment that could be find for deleting.

In second step, the intersection of segments is going to be investigating. If two segments from different classes have an intersection point, the longer segment will be removed. It means delete the segment which itâ€™s constitute points are later from each other, comparing another segment. In multiple dimension feature space, two segments that are close enough are considered to have an intersection point.

As a last step, a pruning will be applied. The aim of this step is to delete sample points that are very close to other class sample points. For each point find the nearest neighbor from same class, and calculate the distance. Find the nearest neighbor from other class and calculate the distance. If the point from other class was closer, so that point is candidate for deleting. The segment will marked as deleted one if the distance between its points is grater than nearest distance. After preparing the new subspace, the classification is doing same as NFL.

We have tested the model on 15 datasets which are obtained from â€¦ library. Improvements are achieved. Comparing to NFL, SFLS provide better accuracy in 11 datasets, while RFLS did better in 12, and our proposed method performance was better for 14 datasets. By removing some segments from the feature space, our method has less calculation complexity comparing to NFL, SFLS and RFLS.

Layout of the thesis

The rest of the thesis is organized as follows. Chapter 2 gives a brief literature review. It also presents the recent modifications of the nearest neighbor classifications. Chapter 3 contains the description of our proposed method and comparison of test results. Chapter 4 presents the conclusion

Chapter 2

2 LITERATURE REVIEW

NFL

The main idea of Nearest feature line method which was proposed for face recognition [3] is to generalize the representational capacity of data samples using lines â€“called feature line- passing through each pair of same class by interpolation and extrapolation of those pair. Therefore, NFL adding extra information to the sample set. This new subsystem that contains feature lines will cover the main disadvantages of NN method. In NN method, while the number of samples increases, the performance increase as well but when the sample set is not big enough the performance would be very low. Therefore, in practice where the number of samples is not enough, generalizing of the data would be very effective.

Classification in NFL is done by computing minimum distance between the query point and nearest FL.

Figure : Classification using NFL method in a subspace generalized by FLs passing through each pair of samples in same class.

Feature line is a straight line passing through x and y of the same class, denoted is called a FL of that class. The query point q is projected onto a feature line as p (Figure : Classification using NFL method in a subspace generalized by FLs passing through each pair of samples in same class.). The projection point is calculates by

While (position parameter) is defined as

Where â€˜.â€™ Is dot dot product. The position parameter describe the position of p point relative to x and y. when it means , if =1 it means , is an interpolation point between and just if and in case that then is a forward extrapolation on side and is a backward extrapolation point in side if .

Figure : The position parameter values

The distance to a FL from query point is defined as

Where is Euclidean distance which is defined as follow

Having computed the distance of FLs of all classes, the class that have the closet FL is selected as the most likely class.

If termed the number of samples belongs to class , and have total C classes, the total number of FLs can be calculated as

NFL drawbacks

Although, it is successful in improving the classification ability, it has some drawbacks. First, by having too much number of samples, it faces large computation complexity problem. Second, is denotes as Extrapolation and interpolation inaccuracy [10].

The Extrapolation inaccuracy may happen in a low dimension feature space. It harm les if the dimension of feature space is large enough [12]. In Figure : Extrapolation inaccuracy in NFL, the query point q which is belong to class "Star", but is classified to class "Circle". This error in classification cause by extrapolation part of FL belong to "Circle" class.

Figure : Extrapolation inaccuracy in NFL

The interpolation inaccuracy happens when a FL cross a cluster of other class. In another words, it may happens when the two points are very far away from each other. Interpolation inaccuracy securely harms classification decision. In Figure : Interpolation inaccuracy in NFL, again, the query point q misclassified to "Star" class while it is belong to "Circle" class.

Figure : Interpolation inaccuracy in NFL

To overcome both of the above-mentioned drawbacks, there some studies that improves the performance of NFL. There is an extended rule called k-NFL where voting is applied on the class labels of k-nearest FLs.

Rectified nearest feature line segment (RNFLS)

As it discussed in previous section, NFL has extrapolation and interpolation inaccuracy. One proposed method to suppress those drawbacks is RNFLS which is improve the performance as well [12].

NFL is a two step algorithm. Create a generalized subspace for all classes and then doing the classification using nearest FL method. The RNFLS rule has developed a different subspace noted as nearest feature line segment subspace (NFLS-subspace) to cover extrapolation in first step of NFL. To overcome the problem with interpolation, it uses a territory of each sample point from each class. The new subspace is named as rectified nearest feature line segment subspace (RNFLS-subspace) which is gain by removing some FL segments that passing from other classesâ€™ territory.

To avoiding extrapolation inaccuracy the RNFLS method finds the projection point on the FL, if the projection point is in extrapolation part then the nearest endpoint is chosen to be reference point for calculating the FL distance. In other case the projection point is in interpolation part that point is consider as reference point same as NFL. As a result, no extrapolation segments will be used then no extrapolation inaccuracy will be occurring because instead of using extrapolation segments, the nearest point of the FL is considered (Figure : NFLS subspace using in RNFLS for avoiding extrapolation inaccuracy).

Figure : NFLS subspace using in RNFLS for avoiding extrapolation inaccuracy

Therefore, the NFLS-subspace () is line segments between pair of each class sample points. NFLS-subspace for class can be represented as

.

Where is a sample belong to class w, is a line segment connecting the point and , and is number of samples belong to class w.

Accordingly, for a query point q, the distance to NFLS-subspace is calculated by

,

The point y, is selected based on the position parameter . If , since, projection point is between and then . For other cases we have when (backward extrapolation) and when (forward extrapolation).

For avoiding interpolation inaccuracy, next step is to rectify the NFLS-subspace by considering segments that trespass from the territory of other classes. There two type of territories, territory of a sample point noted as and territory of a class noted as . Suppose the sample set X is when belongs to class . The radius of the sample territory is defined as

.

It means the radius of the sample is the distance to nearest sample point from other class(s).

Accordingly,

.

The class territory is defined as

Figure : Territory of samples is shown by dots and their union constitues class territory. FL is removed because it is trepassed from other class territory.

In Figure : Territory of samples is shown by dots and their union constitues class territory. FL is removed because it is trepassed from other class territory., each point from "Circle" class has its own sample territory which is represented by doted circles. "Circle" class territory is obtained by the union of its samples territories.

If is the set of class w line segments which trespassed from other class(es) class-territories, RNFLS-subspace is defined as

Where

.

In Figure : Territory of samples is shown by dots and their union constitues class territory. FL is removed because it is trepassed from other class territory., since, then . because .

Classification in RNFLS-subspace is similar to NFLS-subspace the only different is lay down is some segment removed from NFLS-subspace which come from . After this step, is a set of remains segments and classification for a query point is done by similar distance measure.

Figure : Classify using RNFLS-subspace

In Figure : Classify using RNFLS-subspace, and considered as deleted segments. Projection point of the query point q on FL would be backward extrapolation. Then . Same for FL , .

Shortest feature line segment (SFLS)

Another study to overcome NN method disadvantages and specially NFL extrapolation and interpolation inaccuracy is SFLS [9].

SFLS uses a linear model to find the shortest FL segment considering geometric relation between the query point and FL segments, instead of calculating the FL distances. It dose not have any pre-processing step.

The idea behind SFLS is very simple. Find the smallest hypersphere while the query point is inside or on that hypersphere. Hyperspheres are constitues by the feature line as the diameter. For performing such a method, it is taged all feature lines where the query point is inside or on the hypersphere. At the end it find the shoretes tagged feature line. By using just interpolation segments, there will be no extrapolation inaccuracy which is happens in NFL.

Figure : Geometric relation model in SFLS

In Figure : Geometric relation model in SFLS, the query point q is labeled as "Circle" class because the smallest hypersphere that contains this point is created by a feature line segment from "Circle" class.

Suppose the sample set X is when belongs to class . For labeling the query point q following steps are executed for a feature line :

First, calculate the angle between vector and which is defined by

Where is the Euclidean norm.

If , the feature line is not tagged because the query point is not inside or on the hypersphere (case a in Figure : Geometric relation between the query point and FL segment.). In other cases () tag the feature line as a candidate because the geometric criteria is satisfied (case b and c in Figure : Geometric relation between the query point and FL segment.).

(a)

(b)

(c)

Figure : Geometric relation between the query point and FL segment.

Based on Figure : Geometric relation between the query point and FL segment., is outside of the hypersphere if it is an acute angle.

Second, the query point is labeled as the class of the shortest tagged feature line segment.

As it illustrated in Figure : Geometric relation model in SFLS, the query point q is inside two hypersphere, the first whose diameter is and the second one whose diameter is . Although, both feature line and are tagged as a candidate classifier feature line, the shortest one which is classify the query point as "Circle" class.

In some cases that there is no shortest feature line, the corresponding query point is rejected and nearest neighborhood (NN) method will applied to make the final decision.

Chapter 3

2 PROPOSED ALGORITHM

Chapter 4

2 EXPERIENCE

Chapter 5

2 CONCLUSION AND FUTURE WORKS

1. Cover, T. and P. Hart, Nearest neighbor pattern classification. Information Theory, IEEE Transactions on, 1967. 13(1): p. 21-27.

2. Duda, R.O., P.E. Hart, and D.G. Stork, Pattern classiï¬cation. New York: John Wiley, Section, 2001. 10: p. l.

3. Li, S.Z. and J. Lu, Face recognition using the nearest feature line method. Neural Networks, IEEE Transactions on, 1999. 10(2): p. 439-443.

4. Elkan, C., Nearest Neighbor Classification. 2011.

5. Cunningham, P. and S.J. Delany, k-Nearest neighbour classifiers. Multiple Classifier Systems, 2007: p. 1-17.

6. Zhou, Z., S.Z. Li, and K.L. Chan. A theoretical justification of nearest feature line method. in Pattern Recognition, 2000. Proceedings. 15th International Conference on. 2000. IEEE.

7. He, Y. Face recognition using kernel nearest feature classifiers. in Computational Intelligence and Security, 2006 International Conference on. 2006. IEEE.

8. Gao, Q.B. and Z.Z. Wang, Center-based nearest neighbor classifier. Pattern Recognition, 2007. 40(1): p. 346-349.

9. Han, D.Q., C.Z. Han, and Y. Yang, A novel classifier based on shortest feature line segment. Pattern Recognition Letters, 2011. 32(3): p. 485-493.

10. Zheng, W., L. Zhao, and C. Zou, Locally nearest neighbor classifiers for pattern classification. Pattern Recognition, 2004. 37(6): p. 1307-1309.

11. Zhou, Y.L., C.S. Zhang, and J.C. Wang, Tunable nearest neighbor classifier. Pattern Recognition, 2004. 3175: p. 204-211.

12. Du, H. and Y.Q. Chen, Rectified nearest feature line segment for pattern classification. Pattern Recognition, 2007. 40(5): p. 1486-1497.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now