Pso In Gene Clustering

Published Date: 02 Nov 2017

Microarray techniques offer new insights into the biology of a cell by enabling researchers to simultaneously measure the activity of many thousand genes. These help in understanding gene functionality, gene regulatory networks and drug discovery. However due to a large number of genes, the interpretation of such huge mass of data is a big challenge. The first step toward addressing this challenge is the use of clustering techniques, which identifies interesting patterns in the underlying data. Cluster analysis partition a given data set into groups based on specified features so that the data within the same group are more similar to each other than the data in different groups. Thus Gene clustering (Yeung,2001) is defined as the process of assigning gene to a cluster based on similarities in their activity patterns ( co-expressed genes). The genes with similar activity pattern must be grouped together while genes with different activity pattern should be placed in distinct clusters because the genes with similar activity pattern are also functionally related and controlled by the same mechanism of regulation (Co regulated genes). A number of standard clustering algorithm such as hierarchical clustering, K-means clustering and self organizing map (SOM) and Genetic Algorithm (GA) have been used to cluster gene expression data.

In 2003 Xiao et. al (2003) proposes hybrid SOM/PSO algorithm in which SOM is used to cluster the data set in the first stage and then in second stage PSO is used to refine the clustering process by optimizing the weight of SOM. Another hybrid algorithm of PSO with support vector machine is proposed by Alba et. al (2007) to classify gene from cancer data set. A new version of PSO called as geometric PSO is evaluated for the first time in this work that uses a binary representation in hamming space. Author reported 100% classification rate.

In an other work, Li. et. al (2008) combines PSO with GA and SVM. Here PSO/GA hybrid is adopted to select most important gene subsets which are then used to train the SVM classifier. The experimental results over three data sets shows improvement in cluster formation and thus enhances the classification accuracy.

Zhihua. et. al (2008) modifies k-means by in cooperating particle pair optimizer and named it as PK-means clustering algorithm.This hybridization enhances the performance and convergence rate and the experiment shows that the PK-means outperforms k-means. Yarking Lam et al (2011) enhances the performance of PK-means by introducing a concepts of cluster matching which is a two step process. In the step one the sequence of the cluster contained in the particle position is matched with the cluster contained in the position of the particleâ€™s global best position on the basis of nearest distance. After this the sequences of the cluster contained in the current particle position are rearranged according to the matching results. The author reported that the proposed PSO-KM shows superior performance than PK-means and k-means in terms of compaction.

In Memetic K-means algorithm MKMA (2011) at uses Comprehensive Learning Particle Swarm Optimizer (CLPSO) based Memetic Algorithm (MA) to minimize the sums of the squared distances, by combining global search and local search. In each iteration CLPSO partitioned the particle swarm into a leader and populace group based on fitness value. They conduct experiments on two gene expression datasets and reveal that MKMA has consistently attended a better performance in comparison with K-means, fuzzy K-means & PK-means.

In 2012(Lam et. al ,2012) , proposes another algorithm XK-means that uses the concepts of exploratory vector along with hybridization of PSO and k-means. The exploratory vector is added to each centroid before a K-means iteration, as a result the exploitation level gets increased. From the results reported by the author it reveals that the proposed method is faster than the K-means and the PK-means algorithm and shows the best result in terms of cluster compactness and stability.

Sun et. al (2012) proposes the Quantum Behaved Particle Swarm Optimization (QPSO) algorithm for gene clustering. In this proposed work a Multi-Elistic strategy for Quantum Behaved Particle Swarm Optimization knows as Multi-Elistic QPSO is used to update the gbest position of the QPSO algorithm. As a result , the MEQPSO have a stronger global search ability and better overall performance than the original PSO. \

PSO in Phylogenetic Tree Construction

Phylogenetic (Rizzo J and Rouchka Eric C,2007) is the study of the evolutionary histories of living organisms, and represent the evolutionary divergences by finite directed (weighted) graphs, or directed (weighted) trees, known as phylogeny. Based on molecular sequences, phylogenetic trees can be built to reconstruct the evolutionary tree of species involved. In particular, the representation derived from genes or protein sequences is known as gene phylogeny, while the representation of the evolutionary path of the species is often referred as species phylogeny. A gene phylogeny is, to some extend, a local description. It only describes the evolution of a particular gene or encoded protein, and this sequence could evolve much more or less differently than other genes in the genome, or it may have a completely different evolutionary history from the rest of the genome. While in general the topology in phylogenetic trees represents the relationships between the taxa, assigning scales to edges in the trees could provide extra information on the amount of evolutionary divergence as well as the time of the divergence. However there are mainly two types of trees that can be found: a) rooted trees: those that have a single node from which all nodes are derived, and b) unrooted trees: those that do not originate from one clear node. The tree follows the standard graph theory notation where each species is represented as a node or a leaf, and the relationship between species is referred to as an edge or branch. The lengths of the branches represent the time estimate between the species.

a

b

c

d

g

f

e

a

b

c

d a

Fig (a) Fig. (b)

Figure 2: (a) Unrooted Tree, (b) Rooted Tree.

There are many phylogenetic methods available today, each having strengths and weaknesses. Most can be classified as : Distance-based methods and Character-based methods .

Distance-based methods estimate pairwise distances (dissimilarity) prior to computing a branch-weighted phylogenetic tree. If the pairwise distances are sufficiently close to the number of evolutionary events between pairs of taxa, these methods reconstruct a correct tree (Kim and Warnow 1999). This assumption is true for many models of Biomolecular sequence evolution, in which case distance-based methods give sufficiently accurate results (Li 1997). The main advantage of distance-based methods is their small time complexity that makes them applicable to the analysis of large data sets. Most commonly used methods are UPGMA and Neighbor-joining. The UPGMA [Unweighted Pair-Group Method using Arithmetic averages (Rohlf 1963).] was originally proposed for taxonomic purposes but can be used for phylogeny inferring with the assumption that the rate of nucleotide or amino acid substitution is the same for all evolutionary lineages. Compared to UPGMA, Neighbor-joining (Saitou and Nei 1987; Studier and Keppler 1988) NJ is designed to correct the unequal rates of evolution in different branches of the tree. NJ has a low time complexity and like other distance methods performs well when the divergence between sequences is low. Computationally, the tree generation by NJ is similar to UPGMA. When two nodes are linked, their common ancestral node is added to the reduced matrix and the terminal nodes with their respective branches are removed from it. Contrary to UPGMA, neighbor-joining does not produce a dendrogram (ultrametric distance) but an additive tree (additive distance).

A character-based method uses the aligned characters, such as DNA or protein sequences, directly during tree inference. This includes Maximum Parsimony and Minimum likelihood methods.

Maximum Parsimony infers phylogenetic trees by evaluating the possible mutations between sequences. In general terms, the aim of parsimony methods is to find the phylogenetic tree with minimum total length. That is the tree with the smallest number of evolutionary changes explaining the observed data. There are several variations of parsimony. The two simplest and most widely used variations are the Fitch (Fitch 1971). and Wagner (Farris 1970). parsimonies. The Fitch parsimony uses no constraints at all and allows any state to transform directly into any other state, whereas the Wagner parsimony uses a minimum of constraints on permissible character-state changes and assumes that any transformation from one character state to another implies a transformation through any intervening states, as defined by the ordering relationship. The Wagner method assumes that characters are measured on an interval scale; thus, this method is appropriate for binary, ordered multistate and continuous characters. The Fitch method allows unordered multistate characters (e.g. in nucleotide or protein sequences). Both methods permit free reversibility that is the change of a character state in either direction is assumed to be equally probable, and character states may transform from one state to another and back again. A consequence of reversibility is that a tree may be rooted at any point with no change in tree length.

The maximum likelihood approach for inferring phylogenies from sequence data was introduced by Felsenstein (1981). It assigns quantitative probabilities to mutational events, rather than merely counting them. This method compares possible phylogenetic trees on the basis of their ability to predict the observed data. The tree that has the highest probability of producing the observed sequences is preferred. Similarly to maximum parsimony, maximum likelihood reconstructs ancestors at all nodes of each considered tree, but it also assigns branch lengths based on the probabilities of mutations. For each possible tree topology, the assumed substitution rates are varied to find the parameters that give the highest likelihood of producing the observed sequences. The main obstacle to the widespread use of maximum likelihood is computational time. Algorithms that find the maximum likelihood score must search through a multidimensional space of parameters. This makes the solution of large-scale problems (>100 sequences) extremely time consuming. Maximum likelihood estimation may be subject to systematic errors. This happens if the model of evolution used to evaluate the likelihood of giving trees does not reflect the actual evolutionary processes.

The Lv et. al (2004) proposed a novel algorithm for Phylogenetic Tree Reconstruction in which a Discrete Particle Swarm Optimization (DPSO) is used to select the best tree from the population. In the proposed algorithm, Initially the fitness value of each particle is calculated in the population and individual with maximum fitness value is then used for the phylogenetic tree construction. Once the tree is constructed, the population updating and branch adjustment is performed. In the population updating the position and velocity is updated using DPSO position and velocity update equations. In the next step to adjust the branch of the tree, comparison is done. If the distance between two nodes is greater than or equal to 2D (D refers to the distance between two Sequences) then separate the branch otherwise combine the branch. These updation continues until the phylogenetic tree is not optimized. The DPSO algorithm gives optimized results even if initial population is changing. The DPSO algorithm is applied on 25 sequences problem which involve sequences of the chloroplast gene rbcL from a diversity of green plants and Experimental results reveals a satisfactory result when compared to other traditional algorithms.

PSO in Energy Minimization of a Molecule

Molecular modeling can be considered as an application of computerized techniques to analyze molecules and predict molecular, chemical and biochemical properties. Various functions of molecular modeling include structural retrieval generation, visualization superposition, alignment, calculation of molecular properties, dynamic simulation, conformation, search and energy calculation and minimization. Knowing the most stable conformation of a molecule is important because it allows us to understand its properties and behavior based on its structure, it is not necessary that when the molecule is initially built it correspond to one of the stable conformers. It has been found that the lowest energy structure is related to the global minimum of the molecular potential energy function. So energy minimization is usually carried out to determine a stable conformer. Molecular energy minimization is one of the most challenging, unsolved problems in molecular biophysics and now a dayâ€™s many researches from computer science and optimization have paid close attention to this problem.

Many algorithms have been proposed for solving this problem these includes simulated annealing, GA, branch and bound etc. Rong et.al (2009) proposed FM_PSO that combines filter method with particle swarm optimization that improves the ability of exploration and exploitation. Author also incorporates divide and conquer method to speed up the convergence rate. In the proposed method, at the initial stage the whole swarm is divided into 2k sub swarms by divide and conquers method and the best individual in the sub swarms is updated according to the filter technique. As the generation grows, two sub swarms will merge into one until all sub swarms being a whole space .Feasibility and effectiveness of the proposed FM_PSO is tested on 8 benchmark functions and then applied to predict the structure of macromolecule. Results show that FM_PSO is able to solve the high dimensional situation better than the branch and bound method and SPSO algorithm.

PSO in Protein Modelling

Protein is an organic heteropolymer where several amino acids are linked together by peptide bonds. The protein primary structure will fold into a three dimensional configuration to perform its function. This folded functional state of the protein is called the native state. Functional characterization of proteins is one of the most frequent tasks in biology and can be accomplished by determination of tertiary structure of the desired protein. The tertiary structure is determined by either X-ray crystallography or NMR. According to the data collected it has been found that from 51 billion known nucleotide bases more than 46 million individual sequences has been produced among them only 35,701 have their 3D structures solved experimentally using X-ray and NMR because of time consuming and complicated nature of these techniques. Also many proteins are too large for NMR and cannot be crystallized. So computational approaches also termed as Protein Modelling act as a substitute. Several computational methods can be used to fill the gap between sequence and structure space. These approaches can be classified into two broad classes: comparative modeling and De novo(ab initio) modeling.

Comparative Protein modeling uses previously solved structures as a starting point or templates and a scoring function to assess the compatibility of the sequence to the structure to yield possible 3D model. These methods may also split into two groups: Homology modeling and Protein threading. Homology modeling is a prediction of 3D structure of a target protein from the amino acid sequence of a homologous protein for which an X-ray or NMR structure is available. This is the most used and reliable theoretical methods for predicting protein structures out of a sequence. Threading or fold recognition is the method by which a library of unique or representative structures is searched for structural analogies to the target sequence and is based on the theory that there may be only a limited number of distinct protein folds.

De Novo, protein modeling methods seek to build 3D protein models from scratch. These methods assume that the native structure corresponds to the global free energy minimum, accessible during the lifespan of the protein and attempts to find this minimum by an exploration of many conceivable protein conformations. The two key components of de novo methods are the procedure for efficiently carrying the conformational search, and the free energy function used for evaluating possible conformations. Two basic models namely Detailed models and Hydrophobic-Polar models have been developed. Detailed models consider the interactions between all atoms of the protein sequence. Therefore, the search space is huge, taking into consideration an overwhelming number of possible degrees of freedom and interactions between the different atoms. The energy function is usually based on molecular mechanics and force field components such as bond lengths, bond angles, dihedral angles, van der Waals interactions, electrostatic forces, etc. Hydrophobic-Polar (HP) models represent each amino with all of its atoms as one bead labeled as either hydrophobic (H) or polar (P). According to this model, beads lie at points defined by a lattice according to some chosen algorithm such that the most stable structure is the one with the hydrophobic amino acids lying in its core. The underlying concept is that hydrophobic amino acids tend to escape from having contact with the solvent and hence tend to move inside the structure whereas the polar ones remain on the outside. The main energy function used in this model is the total number of the hydrophobic interactions between the amino acids and the goal is to have a lattice with minimum energy, i.e. with maximum number of H-H contacts. HP models can be 2-dimensional (2D) or 3-dimensional (3D). The problem of predicting protein structures is intractable. Hence, heuristic and metaheuristics algorithms have been reported for finding good sub-optimal solutions, among them, in the next section, application of PSO to solve protein structure problem by various researchers have been discussed in detail.

Liu et. al (2005) applies the PSO algorithm to search the ground state of toy model which is the simplest model to represent the protein structure. Experiments were conducted on both artificial data and a real protein data and it is found that PSO is effective to search for ground state of toy model. In 2007, Call et. al first time introduces PSO to perform global optimization of minimum structure search for chemical systems. Author introduces few modifications in the original PSO. First is that it uses two types of velocities, one for each units center of mass and the other for each unit angle. Both of these have their own Vmax. Another novel feature is that the best solution seen by the particle is sometimes not updated with a newly discovered best solution seen in the current iteration. Flexible initial population containing fragmented randomly generated linear and planar structures, enforcement of user defined minimum\maximum distance constraints between atoms, measure of the similarity of structures using a distance metric are few more novel features added by the authors .Simulation results on three chemical structures demonstrate the efficiency of PSO to effectively find global minimum structures. PSO requires small population size and converges fast as compare to simulated annealing (SA) and genetic algorithm (GA) . Meissner et. al (2007) introduces Constriction type PSO (CPSO) as an optimization technique for protein structure prediction. In this research work a course grained "beads on a string" backbone model is used and every particle in a swarm represents a distinct backbone conformations. Root mean square deviation (RMSD) is used as a fitness function and some other scoring function is also applied to efficiently measure the fold similarly. Simulation results show that PSO is capable of optimizing backbone geometries and generates a good solution in refolding studies yielding near native structure for two small sample proteins. Zhang and Li et. al (2007) propose a toy model based PSO for the protein folding problem. Their proposed architecture consists of three parts: - An elitist part, an exploitative part and an explorative part. By incorporating local search and global search author proved that the proposed algorithm is effective to search for the native state of proteins with the lowest free energy. In the work proposed by Datta et. al (2008) hybridization of artificial neural network with particle swarm optimization is done to predict the tertiary structure of protein using Ab-initio approach for global minimization of the energy function. Here three layered ANN trained with back propagation algorithm, is used to predict the side chain dihedral angle while PSO is applied to optimize CHARMM energy function which is used to find main chain dihedral angle. Author shows that this novel algorithm outperforms all other classical techniques in 80% cases and also reduces the dimensionality of the search space. Lin et. al (2009) proposes an efficient hybrid Taguchi genetic algorithm for solving protein folding problem in 2D HP model. This algorithm combines the global exploration capability of genetic algorithm with the strong exploitation capability of Taguchi method. PSO is used to improve the mutation mechanism. The proposed algorithm is tested on 2D benchmark HP protein and shows superior performance in comparison with genetic algorithm, ant colony algorithm, Monte Carlo and tabu search with genetic algorithm. In this research work Kanj et. al (2009), PSO is applied to the protein structure prediction problem in 3D HP model. The proposed algorithm starts with a small set of population representing solution and then gradually explores the search space to find out structures with minimum energy. The algorithm is tested on two sets of benchmark sequences of different lengths and shows that it outperforms the existing algorithms. Bauto et. al (2010) extends binary PSO to predict the tertiary structure of proteins in lattice modal. They introduce a new discrete PSO and Roulette PSO which uses the roulette wheel structure of the GA. Simulation results on six proteins with three lattice models and two folding encoding indicates that the new algorithm performs efficiently and is able to find conformations of minimum energy. In order to reduce the computation time of the protein folding problem, Hernandez et. al (2010), implemented PSO in distributed computing environment and named it as parallel PSO. While predicting the protein 3D structure of minimum energy, parallel PSO consider structural restriction of the protein where the conformation uses the representation of torsion angles of the skeleton and the side chains. Energy is calculated using the energy empirical function ECEPP/3 and the result shows that the proposed algorithm is comparable to the existing algorithm. In order to enhance the performance of protein structure prediction problem in 3D HP model, Cheng Jian Lin et. al (2011) proposed a hybrid genetic algorithm based PSO(HGA-PSO).Here in the first stage genetic algorithm is applied and PSO is used as a mutation operator. This encourages the particles to move towards their own best positions. Simulation results reported by the author show that HGA-PSO shows superior performance than existing evolutionary algorithms. In PSO-SQP for protein folding , Wang et.al (2011) , the training process is divided into two phases. In the first phase particle is trained with standard PSO and then in the second phase sequential quadratic programming (SQP) is used to fine tune the local search. SQP method divides the problem into a sequence of sub problems each of which optimize a quadratic model of the objective subjected to linearization of the constraints. The experimental results on four chains of different lengths shows that the PSO â€“ SQP outperforms both GA and PSO. In this novel algorithm (Chen et.al,2011) levy flight with PSO is used to solve the protein folding problem. 3D AB off lattice model of protein folding is used. Levy flight is a local search method in which step length is chosen from a probability distribution with a power-low-trail based on chaos theory. For experimentation first of all Fibonacci sequence of hydrophobic and hydrophilic amino acids are generated randomly. The algorithm is run up to 2000 iterations and population size is set at 100. 50 independent runs of the proposed algorithm on four real protein sequences from the PDB database shows that it outperforms other existing algorithms. Mansour et. Al (2012) introduces PSO with a repair algorithm to solve PSP in the 3D HP model. The proposed algorithm starts with the random population of particles and each particle is evaluated using an energy function of 3D HP lattice structure. At every iteration swarm is updated using a velocity update equation of PSO with a certain probability (Rate). If two or more amino acids lie at the same point on the cubic 3D lattice than collision occurs and that particle is termed as invalid. This invalid particle is repaired by using a repair algorithm. The algorithm local searches for an alternative empty location for the amino acid which causes the collision. If none is available then it tries to find out the previous amino acids whose location can be modified. If more than three amino acids have been searched or if none can be modified then it is assumed that the particle cannot be repaired and the initial input particle is returned. The new particles replace the old particle if its energy value is lower or equal to the energy of old particle. The experimental result reveals that the proposed PSO perform better when tested for protein of 27 and 64 amino acid length. Kondov (2013) applied PSO for predicting protein structure based on all atom force field. Four variants of PSO namely classical and linear, with and without inclusion of periodic boundary conditions (PBC) is investigated for a series of peptides with 28 to 64 optimization dimensions. Author reported that although classical update scheme yields faster and accurate structure prediction but the inclusion of PBCâ€˜s improves the accuracy and efficiency of both update schemes for all peptides. In this paper performance of synchronous and asynchronous parallel PSO is also investigated and found that the asynchronous parallel PSO is better for any number of workers.

Conclusion:

Bioinformatics is the application of computer technology to the management and analysis of biological data. This field is data driven and aims at uncovering the knowledge hidden in the mass of data so as to obtain a deep insight and understanding into the fundamental biology of organisms. As biological data is growing exponentially, there is a need for rapidly surveying the published literature that allows the researchers to conduct informed work, avoid repetition and generate new hypothesis. Since the introduction of PSO to this field, many variants of PSO have been designed and applied to many problems of Bioinformatics, still they are few. So the aim of this paper is to present a review on application of Particle Swarm Optimization in Bioinformatics and to inspire research and further development on new applications and new concepts in new trend setting directions and in exploiting PSO.

The outcome of this research demonstrates the need for improving the existing tools which are already being applied to solve Bioinformatics problems. Also there are some issues in PSO related to Bioinformatics. Firstly the basic velocity updating scheme in PSO is common to all applications so design of problem specific operators is needed. Secondly PSO is parameter dependent and these parameters require extensive experimentation so that the appropriate range of values can be identified for different Bioinformatics tasks which is very tedious and time consuming. Lastly, PSO and its variants involve a large degree of randomness and different runs of the same program may yield different results so it is necessary to incorporate problem specific domain knowledge in the SI tools to reduce randomness and computational time, so the current research should progress in this direction as well.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now