Transcription Factors And Transcription Factors Binding Sites

Published Date: 02 Nov 2017

* Introduction

* Transcription Factors and Transcription Factors Binding Sites (TF and TFBS)

* Types Of Motif Finding Algorithms

* Word Based Approach (or Exhaustive Enumeration)

* Probabilistic Approach

* Additional Algorithms and Methods

* Leading programs for motif discovery

- MEME

- Motiflab

- MotifSampler

- Weeder

- AlignAce

- NestedMICA

* Programs Challenges

* Strengths of NestedMICA compared to other probabilistic motif finder programs

* Â NestedMICA Vs. MEME

* Conclusions

Introduction

Identification of interesting elements in the DNA, such as genes and regulatory elements of the genes is a great challenge, now that the genomic sequences of an increasing number of plants, fungi, animals and humans are completed and available. The importance of identifying cis-regulatory elements which have a control on the gene expression is huge, along with the TFs (Transcriptional Factors) that bind to these elements, due to the importance of the gene interactions in cells, and due to the ability of it to explain the origins of organismal complexity and development (Debraj G.T., Stormo G.D., 2007). These cis-regulatory elements can consist of Transcription Factor Binding Motifs (TFBM), promoters, enhancers, silencers and other sequence motifs that regulate the genes (Maston, G.A., Evans, S.K., Green, M.R., 2006). To identify (putative) cis-regulatory elements, one could search for known sequence motifs (e.g., TFBMs or promoters) within a specified region upstream of the start codon and in the first intron. Several databases exist (for example, TRANSFAC, RegTransBase and JASPAR) that contain the published TFBMs and promoters (Wertheim, B., 2012).

DNA methylation can alter gene expression and play a major role in stabilizing genomes of plants, fungi, Â animals and humans.It also plays an important role in specifying cell lineages in these organisms (He, G., 2011). DNA methylation of CpG rich regions commonly called CpG islands leads to silencing of gene expression and this modification is inherited in daughter cells (Law, J.A., 2010).

DNA Methylation can occur in TFs and control the gene expression in a few variations. Many diseases or other phenotypic defects are caused by a mutation in the regulation sequences rather than the gene itself.

Transcription factor binding motifs (TFBMs) form the promoters and other regulatory elements, and are a very important stage in the process of understanding gene regulation. Even though that the means for cross-species comparisons do exist, the problem of identifying and modeling of the regulatory elements and the binding sites for the regulatory proteins (TFs) is still a great challenge. This is because TF-binding sites (TFBSs) are seems to be very short, cntaining 6 to 25 nucleotides sequences and the TFs can stand significant degeneracy in their target sites (Debraj G.T. & Stormo G.D., 2007).

Due to complications in identifying TFBMs (Transcriptional Factor Binding Motifs) experimentally, there is only a relatively small number of TFBMs discovered, especially when compared with the identified hundreds of transcription factor genes. In the past decade, an extensive research has been lead in that area, and a large amount of motif-finding tools have been developed, each may have a different strength. However, most of these tools present a poor performance, due to an insufficient knowledge of the TFBMs structure, length and representation in the genome. There are two major strategies used by Motif-finding software: Exhaustive enumeration and probabilistic model. Exhaustive enumeration is a good strategy for ï¬nding perfectly conserved motifs, but for typical transcription factor binding sites, which often have several weakly constrained positions, exhaustive enumeration becomes problematic and the results usually have to be processed with a clustering system. NestedMICA is a sensitive, scalable, pattern-discovery system motif finder, using the probabilistic model. In the current review few Motif-Finding tools will be presented, discussed and compared to NestedMICA (Down, T.A.).

The current review is devoted for Â the comparison between all the available main tools used Â for identification and detection of DNA sequence motifs, which are representing the transcriptional regulatory elements of genes.

Transcription Factors and Transcription Factors Binding Sites (TF and TFBS)

Transcription is the process in which DNA is copied to form a new messenger RNA (mRNA)

which is responsible for the synthesizing of proteins or other cell process such as RNA

interfering. A transcription factor (TF) is a protein which binds to gene at specific sites and regulates the gene expression. TF can either activate gene expression by promoting the recruiting other transcription related proteins or it can repress and block the expression of the gene. This mechanism enables the transcription factors regulate gene expression at molecular level.

The regions in the non-coding pieces of the DNA which contain the binding sites for the regulatory proteins that govern the spatiotemporal expression of genes are the TFBS Transcription Factor Binding

Site. Since those transcription factor binding sites, or regulatory motifs, Â are so short in length, and also can show sequence variation, it makes them hard to identify. Understanding Â these transcriptional regulations is important in so many areas of molecular biology, thus researches were driven to develop various strategies for predicting the presence of TFBS.Motif ï¬nding software is usually used for the detection of transcription factor binding sites in promoter regions, but motif discovery methods, can be also, and are used to find other functional elements in biological sequences in both nucleic acid and protein (Thomas A. Down* and Tim J. P. Hubbard, 2005).

Types Of Motif Finding Algorithms

Most motif finding algorithms belong to two major categories based on the combinatorial approach used: word-based (string-based) method (exhaustive enumeration), represented by regular expressions (RE), or probabilistic sequence models based on position weight matrices (PWM) (Bailey, T.L., 2008). There are, of course, advantages and disadvantages for both methods.

Word Based Approach (or Exhaustive Enumeration)

The word-based method is using regular expressions, in order to exhaustively count and compare oligonucleotide frequencies.(van Helden, J., B. Andre, and J. Collado-Vides, 1998). Regular expressions, can be quite flexible and concise in their ability in "matching" the DNA sequence patterns (or any text strings in other cases). The strength of the word-based method is that, since it does an exhaustive search, Â it can guarantee global optimum. The disadvantage in this method is, that it means that only short motifs can be accepted. This method is a good when looking for motifs where all steps are identical, but when searching for typical transcription factor motifs that often few weakly constrained positions, the word-based method can not be good enough (Vilo, J., et al., 2000).

Probabilistic Approach

The probabilistic approach represents the motif with a position weight matrix (PWM) (Bucher, P., 1990). A PWM calculates the probability of a letter appearing at a specific position with an n by m matrix. n is the number of letters in the sequence (four for DNA) Â and m is the number of positions in the motif. The entry in row i and column j of the matrix is the probability of a letter i occurring at position j in the motif, shown by:

Pij Â Â i<= n Â Â j<= m

This model relies on the assumption Â that each position in the motif is self contained and not dependent of the others. Therefore, the probability of a sequence would be the product of the corresponding entries in the PWM. The strength of the probabilistic approaches is that, when compared with word-based methods, each letter can "match" a particular motif position at certain degrees, rather than just match or no match. Many of the algorithms that are developed using the probabilistic method are designed to find more general motifs.

Additional Algorithms and Methods

More motif discovery algorithms exist, for example, algorithms, that combine the word-based methods and probabilistic approaches (MDScan algorithm). The TAMO algorithm can run few motif discovery algorithms simultaneously (MEME, AlignACE and MDscan) and the output is the combined Â results (Gordon, D.B., et al, 2005). There also exist other approaches which are based on other machine learning techniques, neural networks, and clustering algorithms. It is possible to find algorithms on user-friendly interfaces, which are based on promoter sequences of co-regulated genes and phylogenetic foot-printing, either on web servers or toolboxes. For example, Toolbox of Motif Discovery (Tmod) (which wasnâ€™t mentioned), combines 12 very commonly used motif discovery programs: MDscan,BioProspector, AlignACE, Gibbs Motif Sampler, MEME, CONSENSUS, MotifRegressor, GLAM, MotifSampler, SeSiMCMC, Weeder and YMF (Sun, H., et al., 2010). Recently, novel experimental techniques like chromatin immunoprecipitation (ChIP) have been introduced. Those techniques allow the genome-wide identification of proteinâ€“DNA interactions. Chromatin immunoprecipitation, applied to transcription factors and together with genome tiling arrays (ChIP on Chip) or next-generation sequencing technologies (ChIP-Seq) has opened new fascinating possibilities research, and together with that also introduced some new challenges to bioinformaticians developing algorithms and methods for motif discovery (Federico Zambelli.Graziano Pesole,Giulio Pavesi 2012).

Leading programs for motif discovery

Nowadays, there are a few leading programs for TFBS motif discovery:

1) MEME (Multiple EM for Motif Elicitation):

MEME is one of the most popular tools for searching motifs in biological sequences.

Its operating principle is optimizing the E-value of a statistic related to the information content of the motif. MEME searches repeated, ungapped sequence patterns that can be found in the DNA or protein sequences that the user provides. The input is a set Â of sequences that are supposed to have some unknown common sequence signals, which is entered by the user. MEME then searches for significant motifs in the input sequence. That way, MEME can discover the binding sites for the shared transcription factor in a set of promoters or inside proteinâ€“protein binding sequences in the set of proteins. Since those transcription factor binding sites, or regulatory motifs, Â are so short in length, and also can show sequence variation, it makes them hard to identify. Understanding Â these transcriptional regulations is important in so many areas of molecular biology, thus researches were driven to develop various strategies for predicting the presence of TFBS.Motif ï¬nding software is usually used for the detection of transcription factor binding sites in promoter regions, but motif discovery methods, can be also, and are used to find other functional elements in biological sequences in both nucleic acid and protein (Thomas A. Down* and Tim J. P. Hubbard, 2005).

Figure 1 below shows the output of MEME software for discovering DNA sequence motifs.

Figure 1: MEME: discovering and analyzing DNA and protein sequence motifs, Timothy L. Bailey, Nadya Williams,Chris Misleh,Wilfred W. Li, 2006

Sample MEME output. This is a screenshot of a MEME HTML output form hows a protein motif that MEME has discovered in the input sequences. MEME indicates the sites it identified as belonging to the motif, and above them is the â€˜consensusâ€™ of the motif and a color-coded bar graph showing the conservation of each position in the motif. Some of the hyperlinked buttons that allow the motif to be viewed and analyzed in other ways can be seen at the bottom of the screen shot.

2) MotifLab:

A work space for tools and data integration for motif discovery and analysis of regulatory sequence. MotifLab, is a general working space Â for detecting regulatory sequence domains and discovering transcription factor binding sites and regulatory domains in the genome. With MotifLab users can integrate a few of the popular motif discovery tools, including phylogenetic conservation, epigenetic marks, DNase hypersensitive sites,ChIP-Seq data, positional binding preferences of transcription factors, transcription factor interactions and gene expression. That way it can support extensive motif discovery and analysis. MotifLab can create, manipulate and analyse data objects with several data-processing operations it has to offer. Some of the primary functions of MotifLab are discovering motifs and searching for transcription factor binding sites within sequences.However, MotifLab can't really perform the motif discovery by itself, Â but in order to accomplish tasks of that kind, is actually relying on external programs that are installed on the userâ€™s computer. This is the character that makes MotifLab very flexible with respect to local software preferences or novel tools, but also limited in the same instance. MotifLab supports several popular motif discovery tools, which include AlignACE, BioProspector, MDscan, MEME , MotifSampler Â and Weeder.

3) MotifSampler:

The MotifSampler algorithm uses Gibbs sampling in order to find the position probability of the motif. This implementation uses of higher-order background models in order to improve the feature of the motif finding. MotifSampler comes with background models for few organisms. MotifSampler is a stochastic method which gives different results when run several times with the same parameters. This could seem as a disadvantage but it becomes an advantage when using this tool for multiple runs, and post-processing the results, which gives them a constructive quality. This type analysis is better done on the stand-alone version and should not be used through the website.

4) Weeder:

Weeder adopts parameters and statistical evaluation designed for the identification of conserved TFBS.

Weeder does not rely on any prior knowledge in the discovery process which makes the discovery of novel motifs possible. It tries to statistically differentiate the similarities of the input sequences from the similarities a set of sequences built randomly (Pavesi et al., 2006). The principle Weeder works by, is that a set of k-mers similar to one another should be detected in a group of nonhomologous sequences which contains a motif. These set of kmers could be instances of binding sites of the same transcription factor. Such set of k-mers should not be found when analyzing a random set of sequences which shares no motif ((Pavesi et al., 2006). Weeder represents motifs by using a consensus, that is, a sequence built by using the most frequent nucleotide in each position of the sites. Weeder considers all k-mers differing from the consensus under a threshold number of substitutions to be valid instances of the motif. According to this , the k-mers of the input sequences are grouped, and each group is evaluated with a certain measure of significance. In this step, a species specific background model built from the oligonucleotide distribution of all promoter regions from multiple species is applied. At last, raw results are analyzed and processed to get the motifs that are more possibly to represent the conserved TFBS (Pavesi et al., 2006).

5) AlignACE:

AlignACE is an algorithm implemented in the computer language C to find multiple

motifs in a given set of DNA input sequences. AlignACE is based on a Gibbs sampling

algorithm which was used in the past in finding protein sequence motifs. The improvements are expressed in the following ways: First, the motif model was changed to fit the source genome because of the no-site sequence frequencies. Second, both strands of the input sequences are considered, and overlapping is not allowed under any circumstance. And third, simultaneous multiple motif searching was replaced by a step-by-step search approach (Martin Topma et al., 2005).

6) NestedMICA:

NestedMICA is a new motif discovery system for finding TFBSs and similar motifs in biological sequences. NestedMICA is optimizing a probabilistic mixture model which find the input data as a mixture of functional motifs and background sequence (Thomas A. Down and Tim J. P. Hubbard., 2005). Since, NestedMICA uses a powerful inference technique,which was recently developed, called nested sampling, it means that NestedMICA is able to find the optimal solutions without needing to perform the quite problematic initialization or seeding step, and together with using a novel mosaic background model it achieves extremely high sensitivity. Not only it has improvements at its sensitivity, Â NestedMICA also has a few features which make it especially suited to discovering multiple motifs in large datasets - including whole-genome promoter/enhancer sets. (Down TA, Bergman CM, Su J and Hubbard TJ, 2007).

Programs Challenges

The algorithms developed using the probabilistic approach, use some form of local search (like Gibbs sampling, expectation maximization, or greedy algorithms) , which is very likely not to converge to global optimal solution, and therefore there is no guarantee that they are able to find such (unlike word-based methods). Another concern that comes up, is that real regulatory regions, and a lot of other contexts where interesting motifs can be found, usually contain more than one distinct functional motif. Many regulatory regions also contain several instances of the same motif (Thomas A. Down Â and Tim J. P. Hubbard, 2005). One of the programs that manages to avoid these issues, and therefore its strength is NestedMICA. To begin with, NestedMica uses a sequence model based on the independent component analysis (ICA) framework in order to learn models for multiple motifs at the same time, and together with that, it is using an alternative inference strategy in which there is a greater likelihood of finding a globally optimal model in one run. NestedMICA has also been implemented in such a way in which it is possible to plug in arbitrary background models, thus, allowing the investigation of more sophisticated backgrounds (Thomas A. Down and Tim J. P. Hubbard, 2005).

Strengths of NestedMICA compared to other probabilistic motif finder programs

NestedMICA infers multiple motifs simultaneously. This is a very distinct change from many other probabilistic motif finders, which are operating by a stepwise approach: detecting one motif, masking its occurrences and then finding the next one and so on. On this aspect, NestedMICA is closer with methods that perform parallel inference of multiple motifs. Simultaneous motif discovery is used to achieve much more sensitivity, and it also contributes to the good scalability of the method, since there is no need for the program to start the analysis again for each additional motif (Thomas A Down et al., 2007).

While other probabilistic motif finders are using traditional Monte Carlo methods such as Gibbs Sampling to parameterize their probabilistic models, NestedMICA is using a strategy called Nested Sampling, which is actually a distinctive and more recent Monte Carlo method . This strategy proved to be effective at finding better solutions to the motif-inference issues without requiring heuristical methods to choose good starting states, and therefore was chosen. Another important property, is that an estimates of the evidence term of a Bayesian computation could be achieved Â by the Nested Sampling. Bayesian evidence was always difficult to calculate, but can be used to achieve model comparison in such a way, Â that correctly reduces the extra parameters in high complicated models (Thomas A Down et al., 2007). While the primary aims when coming to develop NestedMICA were developing an extra sensitive tool which is also very rigorous and more accurate in statistical terms, there was also a strong motivation developing a tool which maximizes performance. NestedMICA can run on large databases of sequence data, which can reach up to several thousands of bases, and even though large datasets runtime can be quite long, Â it can be managed if Â the program is ran to spread the workload across on several machines that are connected by a fast network in a distributed mode(Thomas A. Down and Tim J. P. Hubbard, 2005).

NestedMICA Vs. MEME

It is difficult to evaluate the relative performance of a motif-finding software on real data since there are very few large collections of sequences that we can be certain that every functional binding site has been interpreted with high accuracy. Therefore, one of the methods of evaluation was to generate synthetic evaluation sequences that contains a known number of known sequence motifs. To make the synthetic data as close to reality Â as possible, the synthetic data were based on sequence fragments taken from intergenic regions of the human genome, into which experimentally was inserted derived human transcription factor binding sites from the JASPAR collection. A number of human motifs from JASPAR were investigated, representing binding sites from a range of major transcription factor families. Each set of sequences was analyzed using the NestedMICA program, and also MEME version 3.0.4, as shown in figures 2 and 3 for the HLF motif. Both methods were run with default options. Both programs tested tended to fail rapidly. This means that below a certain threshold sequence length (which depends on the method) the recovered motif was always very similar to the target, while above the threshold length a dramatically different motif was found. This rapid failure makes it possible to quantify the performance of a method for finding a particular motif by identifying the longest set of sequences from which it can be successfully recovered. (Thomas A. Down and Tim J. P. Hubbard, 2005). Figure 2 will demonstrate a motif sequence from JASPAR database and the results based on MEME and NestedMICA softwares.

Figure 2: Thomas A Down and Â Tim J P Hubbard, NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence, 2005.

(a) The original HLF motif from JASPAR. (b) Results for searching for HLF in a set of 150 base sequences using MEME. (c) MEME with 200 base sequences. (d) NestedMICA with 600 base sequences. (e) NestedMICA with 700 base sequences.

Results for the selection of JASPAR motifs:

Figure 3 shows the discovery of the HLF sequence with NestedMICA rather than MEME, due to the sensitivity of the tool.

/

Figure 3:Thomas A Down and Â Tim J P Hubbard, NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence, 2005.

*Discovery of the HLF motif from sets of 100 synthetic sequences of various lengths.

It is obvious that NestedMICA came out significantly more sensitive in most cases. The extent of the difference, sensitively speaking, changes depending on the motif in question. In the case of HLF, as shown above, NestedMICA successfully retrieves the expected motifs from sequences four times as long as the longest handled by MEME.(Thomas A. Down and Tim J. P. Hubbard, 2005).

Conclusions

NestedMICA outperforms existing methods, in missions as discovering known regulatory motifs from the JASPAR database (Thomas A Down; Tim J P Hubbard, 2005). When considering cases such as a dataset containing two different known motifs, it seems that NestedMICA responds robustly, and still finds the expected element as well as the decoy (Thomas A Down; Tim J P Hubbard, 2005).An important advantage of NestedMICA is the use of multi-class (mosaic) background models, rather than the single-class Markov chains that most of the probabilistic programs are mostly using (Thomas A Down; Tim J P Hubbard, 2005).

When tested in finding short protein signals, NestedMICA was not tending to Â report high-information content motifs when there was no meaningful motif contained in the dataset (Mutlu DoÄŸruel,Thomas A Down, and Tim JP Hubbard, 2008).

NestedMICA is proved to be a powerful tool when it comes to finding single and multiple motifs even at low motif abundance rates and different motif lengths, thus proving itself to be a robust and sensitive protein motif finder.(Mutlu DoÄŸruel,Thomas A Down, and Tim JP Hubbard, 2008).

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now