Physicochemical Properties Of Amino Acids

Published Date: 02 Nov 2017

This chapter gives a theoretic background needed for understanding the rest of the thesis. The first section explains proteins and amino acids, protein functions, protein structure, the physicochemical properties and the substitution matrix. The second section explains the encoding methods for protein sequences. In the third section of this chapter the machine learning techniques are needed in this thesis explained, like K-mean clustering algorithm and the SVM classifier. Final section covers the main techniques for dimension reduction, which include the feature selection and feature extraction.

Equation Chapter 2 Section 22.1 Proteins and Amino Acids

Proteins are essential elements of all organisms including humans. The human body is about 45% protein, these proteins play many critical roles in the body, they do most of the work in cells and determine its health and function \cite{James2011}. This section introduces the function of proteins, the structure of proteins, the physicochemical properties and the substitution matrix.

2.1.1 Proteins Functions

The proteins play important functions in the human body, without it the bodies would be unable to reform, adjust or protect themselves \cite{Wardlaw2006}.

Each protein within the body has a specific role. Some proteins are involved in structural support, while others are involved in bodily movement, or in defense against germs

The main functions of the proteins in the body are \cite{Wardlaw2006}:

Building and repair all types of tissues in the body.

Source of energy.

Helping for keeping skin, hair and nails healthy.

The proteins can be enzymes, hormones, and many immune molecules.

Helping the essential body processes to do their works, such as water balancing, nutrient transport, and muscle contractions.

2.1.2 Protein Structure

Proteins vary in their structure as well as function. The structural unit of the protein is the amino acid, where each protein is constructed from a set of 20 amino acids. The protein represented as a string where each character in this sequence represents one amino acid, where the order of the amino acids in the sequence determine the function of the proteins \cite{McKee2011}.

Sequences of proteins contain at least 50 amino acids, when the sequence contains less than 50 amino acids it called a peptide. The peptide is a short chain of amino acid residues linked by peptide bonds, and itÂ has the same basic structure as protein \cite{McKee2011}.

2.1.3 Physicochemical properties of amino acids

Amino acids that form the proteins determine the properties of proteins, that because each amino acid has a set of physicochemical properties (PCPs), these PCPs can be used to study protein sequence profiles, folding and function \cite{Mathura2005}.

The amino acid properties can be represented by the set of numerical values, which are known as the amino acid indices \cite{Tomii1996}.

A few databases of amino acid indices have been constructed and regularly maintained and the most important ones are: AAindex, APDbase \cite{Kawashima1999}\cite{Mathura2005}. AAindex contains 544 properties \cite{Kawashima1999}, where APDbase contains 242 properties \cite{Mathura2005}.

Table 2.1: Example of three physicochemical properties (size, charge, hydrophobic ) for 10 amino acids, these values are taken from AAindex database \cite{Kawashima1999}Table 2.1 represents an example of physicochemical properties of some amino acids, where the hydrophobic is a measure of how strongly the side chains are pushed out of the water (Eisenber, et al., 1982) \cite{ Eisenberg1982}.

Size

156

128

115

128

114

129

137

101

Charged

10.8

9.7

2.8

5.7

5.4

3.2

7.6

5.7

5.9

6.5

Hydrophobic

-7.5

-4.5

-3

-2.9

-2.7

-2.6

-1.7

-1.1

-0.8

-0.3

2.1.4 Substitution Matrices

In this subsection, the sequence alignment will be clarified before talking about the substitution matrices.

Sequence alignment

Sequence alignment is the technique of comparing the two (pairwise alignment) or more sequences by searching for a series of individual characters (amino acids) or patterns that are in the same order in the sequences \cite{Rosenberg2009}.

There are two main types of alignment; local and global. In global alignment, all amino acids are aligned either with other amino acids and no amino acids is discarded, it is most useful when the sequences are roughly equal size . In other hand, local alignment uses the most similar segments between sequences and ignores the rest (compare a part of the sequence with a part of the sequence) \cite{Rosenberg2009}. Figure 2.1 illustrates the difference between these two types.

Figure 2.1: The difference between the global and local alignment

The result of the comparison is a score that determines the degree of similarity between the sequences based on alignment, e.g. in Figure 2.1 if we propose the score as a number of matches between the two sequences then the score for the global alignment will be 9, and for local will be 8.

Substitution metrics

The substitution matrix is also a technique used to find the similarity between sequences, but this technique depends on the physicochemical properties of amino acids in order to determine how the amino acids substitute one another \cite{Junior2003}. A substitution matrix (S), where Sij is used to score aligning character i with character j, should reï¬‚ect the probability of two characters substituting one another \cite{Kesmir2013}. Figure 2.2 illustrates the structure of the substitution matrix.

The first step of building the substitution matrix is based on collecting a big dataset of alignments proteins \cite{Kesmir2013}.

The substitution matrix looks as follows:

Figure 2.2: The structure of the substitution matrix

The size of substitution matrix is 20*20.The scores of substitution matrix assigned so that the score based on the observed frequencies of such occurrences in alignments of related proteins. This score reflects the frequency that a particular amino acid occurs in nature, as some amino acids are more abundant than others. Substitutions that are more likely should get a higher score, but the substitutions that are less likely should get a lower score \cite{Junior2003}.

There are two major types of matrices; PAM (Percent Accepted Mutation ) and BLOSUM (Block Substitution Matrix).

The PAM matrices are based on global alignment of related proteins, and they depend on the explicit evolutionary model, whereas the BLOSUM matrices are based on local alignment of related proteins and implicit rather than explicit model of evolution \cite{Kesmir2013}.

There are two types of PAM: PAM250 and PAM120, the PAM250 used when aligning distantly related sequences and the PAM120 is used when aligning closely related sequences \cite{Kesmir2013}. In other hands, there are two types of BLOSUM: BLOSUM-50 and BLOSUM-62, the BLOSUM-50 used when aligning distantly related sequences and the BLOSUM-62 is used when aligning closely related sequences \cite{Kesmir2013}.

2.2 Encoding the protein sequences

In order to apply machine learning algorithms to investigate protein sequences, the protein sequences need to be represented numerically. The two major encoding methods of protein sequences are: encodings based on the amino acid sequence and encodings based on physicochemical properties of the amino acids \cite{Nanni2011}. This section consists description of some of these methods.

First: Encodings based on the amino acid sequence

Different methods have been developed to encode the sequences using the amino acids characters. Some of these methods are:

Amino acid composition (AC): this simple encoding method, finds the frequency of each amino acid in the protein sequence, so the feature vector contains 20 features \cite{ Bhasin2004} regardless of the length of the protein chain. Figure 2.3 shows an example of AC encoding.

Figure 2.3: An example of AC encoding

The dipeptide composition: calculates the frequency of each dipeptide (peptideÂ chain which includes two consequent amino acids) in the sequence, this method take into account the order of amino acids in the sequence. The feature vector contains 400 features \cite {Bhasin2004}.

Orthonormal encoding (OE): (distributed encoding or sparse encoding) Each amino acid is represented by a 20-bit vector with 19 bits set to zero and one bit set to one, the exist of amino acid mean 1 else 0.

Second: Encodings based on physicochemical properties

The physicochemical propertiesÂ of proteins help to determine the structure and function of the protein sequence \cite {Lapinsh2002}. Methods used to encode sequences based on physicochemical properties can be simple, complicated or encoding depending on distributing amino acids into groups based on their physicochemical properties.

Simple methods such as:

The simplest method is to represent each amino acid numerically as a set of different physicochemical properties (concatenating method). For example if the length of the sequence is N and each amino acid represent by 5 properties, then the length of the feature vector will be N*5 \cite{Rackovsky2009}. In this method the length of the feature vector depends on the length of the protein sequence. Figure 2.4 shows an example of this encoding method.

Figure 2.4: An example of encoding method based on representing each amino acid by a set of PCPs. Three PCPs were used (Size, Charge, and hydrophobic).

The second method called the average physicochemical encoding. This encoding is invariant to the length of the sequence, thus mainly suited for proteins. Each feature is represented by the average value of a physicochemical property with respect to the amino acid in the sequence, therefore the feature vector is composed by F features where F is the number of selected PCPs \cite{Nanni2011}. Figure 2.5 shows an example of an average physicochemical encoding method.

Figure 2.5: An example of average physicochemical encoding Two PCPs were used(hydrophobic and charge) after normalized it to be between 0 and 1

Weighted physicochemical encoding, this encoding originally developed for proteins, where instead of considering the amino acids in their natural order, they are concatenated alphabetically and weighed according to their frequency in the sequence. Therefore the feature vector is composed by 20 * F features \cite{Nanni2011}.

Complicated methods, such as:

Autocorrelation: describe the level of correlation between two objects based on their specific structural or physicochemical property, which are defined based on the distribution of amino acid properties along the sequenceÂ \cite{Ong2007}. Eight amino acid properties are used for deriving the autocorrelation descriptors, these properties are: ( hydrophobicity scale, average flexibility index, polarizability parameter, free energy of amino acid solution in water, residue accessible surface areas, amino acid residue volumes steric parameters, and relative mutability \cite{Ong2007}.

There exist mainly three types of autocorrelation descriptors: Moreau-Broto, Moran and Geary autocorrelation descriptors. All PCPs values of amino acids should be normalized before applying these encodings, the normalization process describes in Equation 2.1.

Where is the PCP after normalized, is the PCP before normalized, is the e average of the PCP of the 20 amino acids and is defines in Equation 2.2.

The is the standard deviation of the PCP, see Equation 2.3.

Normalized Moreau-Broto autocorrelation descriptors\cite{Ong2007}

The normalized Moreau-Broto autocorrelation descriptors can be defined as follows:

Where:

is called the lag of the autocorrelation (e.g: lag 1 means correlating between the variable and ).

and are the properties of the amino acids at position and, respectively.

nlag is the maximum value of the lag.

Moran autocorrelation descriptors \cite{Ong2007}

The Moran autocorrelation descriptors can be defined as follows:

Geary autocorrelation descriptors \cite{Ong2007}

The Geary autocorrelation descriptors can be defined as:

The quasi-sequence-order descriptors are proposed by K.C.Chou, et.al (2000). They are derived from the distance matrix between the 20 amino acids. The physicochemical properties computed include hydrophobicity, polarity, and side-chain volume \cite{ Chou2000}.

Sequence-order-coupling Number

The d-th rank sequence-order-coupling number is defined as \cite{Chou2000}:

where is the distance between the two amino acids at position i and i + d.

Quasi-sequence-order Descriptors

For each amino acid type a quasi-sequence-order descriptor can be defined as \cite{Chou2000}:

where is the normalized occurrence for amino acid, and w is a weighting factor (often w = 0,1) .

These are the first 20 quasi-sequence-order descriptors. The other quasi-sequence-order descriptors are defined as:

The pseudo amino acid composition (PseAAC): it is similar to the quasi-sequence order descriptor, it proposed by Chou (2001) \cite{Chou2001}. The pseudo amino acid descriptor is made up of a (20+k) vector in which the first 20 components reflect the effect of the amino acid composition and the remaining components reflect the effect of sequence order by the correlation factors of the different ranks. The last K features are obtained based on a given physicochemical property \cite{Chou2001}.

The PseAAC can be described as follows:

If the protein sequence has L amino acid residues:

Sequence order effect can be approximately reflected with a set of sequence order-correlated factors as defined below:

The Â is called the first-tier correlation factor that reflects the sequence order correlation between all the most contiguous residues along a protein chain,Â Â the second-tier correlation factor that reflects the sequence order correlation between all the second most contiguous residues, and is the -th tier correlation factor \cite{Chou2001}.

The correlation factor can defined as:

WhereÂ Â is the feature (e.g. size) value of the amino acidÂ . The value is converted from the original feature value of the amino acid according to the following equation:

whereÂ Â is the original feature value of the amino acid Â . So, the feature vector () of the protein can be represented by a (20+http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0018476.e016&representation=PNG) vector as follows:

Where Â Â represent the amino acid composition (AC) which was described earlier.

Encoding depending on distributing amino acids into groups based on their PCPs:

The main example of this type of encoding is the Composition, Transition and Distribution(CTD), this method developed by Dubchak et al. (1995).

In this method the amino acids are divided into three classes according to its attribute and each amino acid is encoded by one of the indices 1, 2, 3 according to which class it belonged. The amino acids distributed into three classes based on 7 physicochemical properties \cite{ Dubchak1995}, see Table 2.2.

Table 2.2: Distribution the amino acids into groups based on their PCPS

Attributes

Group 1

Group 2

Group 3

Hydrophobicity

Polar

R,F,E,D,Q,N

Neutral

G,A,S,T,P,H,Y

Hydrophobicity

C,L,V,I,M,F,W

Normalized van der Waals Volume

0-2.78

G,A,S,T,P,D,C

2.95-4.0

N,V,E,Q,I,L

4.03-8.08

M,H,K,F,R,Y,W

Polarity

4.9-6.2

L,I,F,W,C,M,V,Y

8.0-9.2

P,A,T,G,S

10.4-13.0

H,Q,R,K,N,E,D

Polarizability

0-1.08

G,A,S,D,T

0.128-0.186

C,P,N,V,E,Q,I,L

0.219-0.409

K,M,H,F,R,Y,W

Charge

Positive

K,R

Neutral

A,N,C,Q,G,H,I,L,M,F,P,S,T,W,Y,V

Negative

D,E

Secondary Structure

Helix

E,A,L,M,Q,K,R,H

Strand

V,I,Y,C,W,F,T

Coil

G,N,P,S,D

Solvent Accessibility

Buried

A,L,F,C,G,I,V,W

Exposed

R,K,Q,E,N,D

Intermediate

M,S,P,T,H,Y

Each sequence converted into a new sequence where each amino acid is represented by a number of a group depended on each previous attribute. Then, we can find three values for each sequence, these values represents the composition (C), transition (T) and distribution (D).

Composition: Composition can be defined as:

where is the number of in the encoded sequence and is the length of the sequence \cite{Nan2012}, for each sequence we can find 21 values represent the composition for 7 attributes, and for each attributes three groups.

Transition: Represent the transition from one group to another for the same attribute \cite{Nan2012}, e.g.: transition from class 1 to 2 is the percent frequency with which 1 is followed by 2 or 2 is followed by 1 in the encoded sequence. The transition can be defined as:

Also, for each sequence we can find 21 values represent the transition.

Distribution: The distribution descriptor describes the distribution of each attribute in the sequence. There are five distribution descriptors for each attribute and they are the position percentâ€™s in the whole sequence for the first residue, 25% residues, 50% residues, 75% residues and 100% residues \cite{Nan2012}. For each sequence we can find 105 values represent the distribution for 7 attributes, and for each attributes three groups, where 5 values (residues) for each group .

2.3 Machine Learning Techniques

Machine learning is concerned with the development of algorithms and techniques that allow computers to learn, it can be defined as a science of algorithmic methods of learning from experience with the goal of improving performance on selected tasks \cite{Michalski1983}.

There exist two main types of machine learning, these types are \cite{Hertzmann2010}:

Supervised Learning: such as classification

Unsupervised Learning: such as clustering

This section introduces a description of classification and clustering.

2.3.1 Cluster Analysis

Clustering is a very common technique in unsupervised machine learning to discover groups of data that behave similarity based on features describes the objects. The result of cluster analysis is a number of heterogeneous groups with homogeneous contents, where there are substantial differences between the groups, but the individuals within a single group are similar \cite{Tan2006}. The main advantage of the clustering is that it can be used to reduce the data, by replacing all of the elements in a cluster with a single representative element.

A good clustering method will produce high quality clusters in which the similarity in the intra-class is high, and the inter-class is low, see Figure 2.6

Figure 2.6 : Distances of intra and inter classes.

In this subsection the K-mean clustering method is explained as an example of clustering.

K-mean

K-means algorithm is one of the simplest clustering algorithms that solve the well-known clustering problem. The k-mean algorithm classify a given data based on a certain number of clusters (assume k clusters), for each cluster the centroid should be defined \cite{Tan2006}. K-mean is described by Algorithm 1.

The K-mean algorithm is sensitive to the initial randomly selected cluster centers, so the k-means algorithm should run multiple times to reduce this effect \cite{Hertzmann2010}. The K-mean algorithm also is sensitive to the number of clusters, the number of clusters can be determine as a fix based on previous knowledge, or try to find a suitable number of clusters \cite{Hertzmann2010}.

2.3.2 Classification

Classification is a very common technique in supervised machine learning to generate input-output mapping relations from a set of labeled training data \cite{Michalski1983}. Figure 2.7 illustrates the concept of classification.

Figure 2.7 : Diagram of classification method

There exist different tools that can be used to solve classification problem. In this subsection the Support vector machine (SVM) is explained as an example of classification tool.

Support Vector Machine (SVM)

SVM is a supervised learning technique that generates input-output mapping relations from a set of labeled training data. SVM is a linear classiï¬er that can separate the data, so that it can maximize the margin (maximizes the distance between it and the nearest data point of each class); the result is a hyperplane that separate the two classes. The SVM can be applied for classification and regression \cite{Gunn1998}. In this subsection the SVM for classification will be described.

To use the SVM the input data should be transformed into a high-dimensional feature space using the nonlinear kernel functions. In order to make input data more separable \cite{Gunn1998}.Figure 2.8 illustrates the mapping to higher dimensional space.

Figure 2.8 : Mapping data into a higher dimensional feature space

SVM is a two class classier. The data for a two class learning problem consists of objects labeled with one of two labels corresponding to the two classes; for suitability we assume the labels are +1 (positive examples) or 1 (negative examples) \cite{Gunn1998}.

Let is a training points, where each input has D attributes (D-dimensions) and is in one of two classes . In general the linear classifier can be defined as the dot product between two vectors, as follows:

A linear classifier is based on a linear discriminant function of the form

where is weight vector, and is the bias, assigns score for each point in order to classify the point according to this score.

The hyperplane can be described by , this hyperplane divides the space into two half spaces according to the sign of , that indicates on which side of the hyperplane a point is located, if, then one decides for the positive class, otherwise for the negative. The boundary between regions classified as positive and negative is called the decision boundary of the classifier \cite{Gunn1998}.

Linear separable data

For the linear separable data, there exists many of the hyperplanes that correctly classifies data points, but we should choose the optimal hyperplane, that maximizes the margin \cite{Bennett2000}. Figure 2.9 illustrates the possible separating hyperplanes for a set of data.

Figure 2.9 : Many possible separating hyperplanes

To find the optimal hyperplanes all points should confirm the following constraint

Also we should find the optimal b and w corresponding to the maximum margin hyperplane; one has to solve the following optimization problem \cite{Bennett2000}.

Where the minimizing in the previous equation means maximizing the margin, this classifier that is applicable to linearly separable data maximum the margin that correctly classify all the input data, see Figure 2.10. The classifier that is called hard margin SVM \cite{Bennett2000}.

Figure 2.10 : Maximum Margin

Non-linear separable data

In, practice, data are not linearly separable, so SVM provides a soft margin SVM for this type of data, that provides a greater margin that allows the classifier to misclassify some data, by allows errors, so the constraint on points will be changed to the following \cite{Bennett2000}.

where are slack variables that allow data to be in the margin or misclassiï¬ed, and, the optimization problem will be as follows\cite{Bennett2000}.

The constant C > 0 sets the relative importance of maximizing the margin and minimizing the amount of slack \cite{Bennett2000}.

To solve the previous optimization problem the method of Lagrange multipliers is used, it reformulates the original primary problem into dual formalization; it is expressed in terms of as \cite{Bennett2000}:

under the following constraint:

Then the weight vector can be expressed as

The for which are called support vectors, see Figure 2.10.

The data that relate to non-linearly separable should be mapped to higher vector space using the mapping function , then the discriminant function expressed as \cite{Bennett2000}.

In Equation 2.23 is linear function that because it defined using the mapping function.

The mapping can be done using kernels, the weighting vector is updated using the kernel as follows \cite{Bennett2000}.

Then the new substituting in the discriminant function, as follows

Where the is a kernel function, that defined as follows \cite{Bennett2000}.

Kernel Functions

There exist different kernel functions that can be used, the main kernel functions are \cite{Ben-Hur2008}:

Linear kernel: it is the simplest kernel function. It is computed by the inner product plus an optional constantÂ as follows

Polynomial kernel: it is suitable for problems where all the training data is normalized.

Where is the slope that is an adjustable parameter and is the degree of the polynomial.

Gaussian kernel: it is an example of a radial basis function kernel.

whereÂ is a parameter that controls the width of the Gaussian, it plays a similar role as the degree of the polynomial kernel.

2.4 Dimension reduction techniques

The dimension of the data represents the number of variables(features) that are measured for each observation (instance) \cite{Fodor2002}. When data objects that will be used by machine learning techniques are described by a large number of features (i.e. The data is high dimension) it is often beneficial to reduce the dimension of the data \cite{Conn2007}. Dimensionality reduction is the transformation of high-dimensional data into a meaningful representation of reduced dimensionality\cite{Maaten2007}.

Dimensionality reduction is an important task in machine learning for different reasons \cite{ AlpaydÄ±n2010}:

Facilitates classification, compression, and visualization of high-dimensional data.

When an input is unnecessary (e.g. redundant), we save the cost of extracting it.

Reduced both the time and space complexity.

There are two main methods for reducing dimensionality: feature selection and feature extraction.

2.4.1 Feature selection

In feature selection, we select a set of the D dimensions that give us the most information and we discard the other dimensions (unimportant features). There are two approaches for feature selection: forward and backward selection \cite{ AlpaydÄ±n2010}.

In forward selection, we start with an empty set and add features one by one, at each step adding the one that decreases the error the most, until any further addition does not decrease the error, where in backward selection, we start with all features and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly), until any further removal increases the error significantly \cite{ AlpaydÄ±n2010}.

2.4.1 Feature extraction

In feature extraction, we find a new set of k dimensions that are extracted from the original D dimensions. These methods may be supervised or unsupervised depending on whether or not they use the output information \cite{ AlpaydÄ±n2010}.

In this subsection three methods of feature extraction are discussed, these methods are: Principal Components Analysis (PCA), Factor Analysis (FA) and Multidimensional Scaling (MDS).

Principle Component Analysis (PCA):

PCA is the dimension reduction technique that is widely used due to its simplicity and efficiency \cite{Burges2009}.

Let is a sample data described by a set of features (p features) this can be represented by a feature-object matrix where each column represents an object, the covariance of these data defined as \cite{Burges2009}

Where the diagonal terms in capture the variance in the individual features and the off-diagonal terms quantify the covariance between the corresponding pairs of features \cite{Burges2009}.

For this covariance matrix we can calculate the eigenvectors and eigenvalues so that there exist p eigenvalues and eigenvectors, then the process of dimension reduction started by select the eigenvector with the highest eigenvalue (principle component) of the data set \cite{ AlpaydÄ±n2010}. So the data will represent by a set of new features(k features) where , the new data can be expressed as follow

Factor Analysis (FA)

In FA if there exist a group of variables that have high correlation among themselves and low correlation with all the other variables, then there may be a single underlying factor that gave rise to these variables \cite{ AlpaydÄ±n2010}.

FA depends on partitioning the features into factor clusters, and then few factors can represent these groups of features. In FA we can obtain the original features from the factors, but in PCA we canâ€™t \cite{ AlpaydÄ±n2010}.

Multidimensional scaling (MDS)

If we know the distances between the pairs of points but we donâ€™t know the exact coordinates of the points, their dimensionality, or how the distances are calculated \cite{ AlpaydÄ±n2010}. Multidimensional scaling (MDS) is the method for placing these points in scaling a low (e.g. two-dimensional) space such that the Euclidean distance between them there is as close as possible to , the given distances in the original space \cite{ AlpaydÄ±n2010}.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now