Computational Modeling Of Small Molecule Inhibitors Of Epigenetic

Published Date: 02 Nov 2017

Abstract

Background

Cell to cell variation in protein levels describes a complex interplay between gene regulatory networks and expression at various levels of organisation. In mammalian cells, the regulation of genes is orchestrated by DNA methylation and histone modifications predominantly the interactions within chromatin environment forming epigenetic mechanisms. This interplay impacts a variety of phenomena, such as pathogenicity, disease, adaptation to changing environments and differential cell-fate outcome . Epigenetics has become a central issue in biological studies of development and disease. Inhibitors for various epigenetic modifiers are potential drug targets for several diseases including cancer.

Results

In the present study, we have used four different high throughput screens available for the inhibitors of epigenetic modifiers. Computational predictive models were constructed based on the molecular descriptors generation owing to the activity of molecules. We used two machine learning algorithms, Naive Bayes and Random forest, and developed highly accurate predictive classification models. Significant substructures with potential biological activity were also identified using substructure search approach .

Conclusion

Machine learning algorithms for supervised training, Naive Bayes and Random Forest, were used to generate predictive models for the small molecule inhibitors of histone methyltransferases and demethylases. Random forest, with the accuracy of 80%, was identified as the most accurate classifier. Further we complemented the study with substructure search approach filtering out the probable pharmacophores from the active molecules leading to drug molecules. To the best of our knowledge, this is the first and most comprehensive computational analysis towards understanding activities of small molecules inhibitors of epigenetic modifiers.

Introduction

Though all cells in an organism inherits the same genomic template, the dynamic expression of the genome provides for the cell-type and tissue specific organisation and functional organisation of multi-cellular organisms [1]. This dynamic regulation is largely dependent on the regulatory layer of interactions between multiple biomolecules, operating at the chromatin organisation, transcriptional and post-transcriptional layers [2]. The regulatory layer contributed by the Epigenetic layer has been one of the favourite areas of interest recently [3]. The epigenetic layer of regulation comprises largely of DNA modifications, histone modifications and noncoding RNA regulation and the interplay between each of these major components. The understanding of this epigenetic layer of gene regulation has largely been fueled by large-scale genome-wide maps of both DNA modifications and histone modifications [4] , thanks to the availability of high-throughput sequencing based assays to qualify epigenetic marks across the genome. Epigenetic modifications and dysregulation of epigenetic modification has been implicated in the pathophysiology of a wide spectrum of diseases [3].Though the present understanding of the role of epigenetic dysregulation contributing to the pathophysiology of diseases is just being unravelled, a number of diseases including cancers [5] neuropsychiatric disorders [6], metabolic disorders [7]have been shown to have a strong association with epigenetic dysregulation.

Histone organization and post-transcriptional modification of histones contribute a major and well studied class of epigenetic marks. Histone proteins are integral components of the nucleosome and post-transcriptional modification of histones and their interplay with DNA base modifications largely regulate the transcription of genes. These post-translational modification of histones are modulated by proteins popularly known as histone modifiers, which dynamically regulate the pattern of modifications across the genome in a concerted, but poorly understood mechanism. Ample evidence in the recent years have pointed to how DNA methylation and histone modifications could modulate gene expression [8], mark gene boundaries [9] and potentially differentiate between protein-coding and noncoding gene promoters in the genome [10,11]. Histone modifiers or Epigenetic modifiers are largely categorised into three groups [12]. The first group of proteins largely post-translationally â€˜writeâ€™ marks on the histone tail. Well studied examples of such proteins include histone Methyltransferases or acetyltransferases. The second group of proteins largely â€˜eraseâ€™ existing marks on the histone, and include well characterized proteins like demethylases and deacetylases. The third and potentially poorly understood class of proteins recognise specific epigenetic marks and bind to the histone complex modulating their regulatory effect.

Epigenetic modifiers have been recently studied in detail as they could provide for attractive drug targets for diseases where epigenetic dysregulation plays an important role, as in the case of some cancers [13]. This has been complemented by the availability of high throughput screening methodologies and assays for many of these proteins [14,15]. The availability of these large-scale screening datasets in public domain also provides an immense opportunity to model the activities based on physicochemical and structural properties of molecules. Such approaches have been extensively used by our group previously towards modelling activities[16,17,18,19]. Such an approach would be immensely useful in drug discovery applications to significantly reduce the time and effort by prioritising molecules with desirable activities for in-depth screening and biological validation.

In the present study, we employed four datasets of high-throughput screens for inhibitors of epigenetic modifiers. We use machine learning approaches using chemical descriptors to create accurate predictive models for the activities. We also use an independent substructure based approach to identify commonly enriched substructures associated with the activities. To the best of our knowledge, this is the first report of computational modelling of biological activities of small molecule inhibitors of epigenetic modifiers.

Methodology

Dataset Source

The data for the potential inhibitors of Histone methyltransferases and demethylases was downloaded from PubChem repository of chemical compounds. The datasets were downloaded corresponding to AID 504332, AID 504339, AID 2147 and AID 540317.

Bioassay AID 504332: The qHTS was based on an assay developed for the inhibitors of G9a (Histone Lysine Methyltransferase) and included 30,875 active and 2,67,000 inactive compounds. G9a is a histone methyltransferase which belongs to SET-domain containing family and specifically catalyzes methylation of Lys9 of histone H3 (H3K9) in mammalian euchromatic regions repressing the transcription[20,21]. The knockdown of G9a results in transcriptional activation and inhibits cancer cells growth [22].

Bioassay AID 504339: The dataset contains inhibitors of JMJD2A-Tudor Domain JMJD2A, which is a jumonji-domain-containing histone demethylase (Lysine-specific demethylase 4A). JMJD2A binds to trimethylated H3K4 and H4K20 via the tudor domains and causes demethylation which may result in both, transcriptional repression and activation [23,24]. Binding of JmjD2A to histone results in positioning of the enzymes for methylating adjacent regions causing rapid methylation over large area of chromatin [25,26].Â Targeting of the JMJD2A-tudor domain interaction with the methylation marks on lysine residues of histone, H3 and H4, tails may lead to selective demethylation of a given methyllysine locus based on the methylation state of adjacent histone marks.The enzymatic activity of JMJD2 resides within a JmjC jumonji domain that employs a distinct catalytic mechanism, using Fe(II) and Î±-ketoglutarate as co-factors, and a radical attack mechanism. The precise complex association of the enzyme with co-factors determines which lysine is to be demethylated. Some of the enzymes are only capable of demethylating mono- and di-methyl substrates, whereas others can demethylate all three states of the methylated lysine. The data contain 16,919 active compounds and 3,38,945 inactive compounds.

Bioassay AID 2147: The dataset contains inhibitors of Human Jumonji Domain Containing 2E (JMJD2E). JMJD2E is a member of histone demethylases and belongs to the Fe(II) and 2-oxoglutarate oxygenase (2OG) superfamily. Histone lysine demethylases catalyze the removal of methyl groups from methylated lysine sidechains on histones H3 and H4, thus acting reversibly to the reactions catalyzed by histone lysine methyltransferases. The high throughput data contained a total of 3,523 active and 1,88,950 inactive compounds.

Bioassay AID 540317: The assay was developed to identify the first inhibitors of protein methyltransferases. The dataset contained 2,142 active and 3,67,962 inactive compounds screened for potential inhibitors of HP1-beta Chromodomain Interactions with Methylated Histone Tails HP1(Heterochromatin protein). The N- terminal chromodomain containing HP1 proteins bind to the methylated histones and further results in gene repression and heterochromatin formation. The interaction harbors an N- terminal chromodomain that binds to the tri-methylated lysine 9 of histone H3, H3K9me3, and a C-terminal chromoshadow domain.

Compounds in PubChem are characterized based on Activity Score calculated using AC50. AC50 is the concentration at which 50% of the activity is observed. Compounds having AC50 values less than or equal to 20 micromolar with corresponding activity score between 40-100 were considered as active compounds. Compounds having AC50 value greater than the highest concentration tested (for example 20 micromolar) and activity score 0 were considered as inactive compounds. The rest compounds with activity score between 1-39 were considered as inconclusive compounds.

All the datasets were obtained through the confirmatory bioassay screens conducted by NCGC, NIH Molecular Libraries Probe Production Network. The Amplified Luminescent Proximity Homogeneous Assay (AlphaScreen) from PerkinElmer was used for identification of G9a inhibitors. It is a homogeneous assay technology used for screening of different classes of targets and analytes. Donor and acceptor beads coated with a layer of hydrogel are utilized. The beads are conjugated with biological molecules. With excitation, ambient oxygen is converted to reactive singlet oxygen in the donor bead. The singlet oxygen species reacts with thioxene compounds in the acceptor bead to generate a chemiluminescent signal that emits at 370 nm. The method is well suited for detection of inhibitors acting by the desired histone peptide competitive mechanism. Methylation of biotinylated-histone peptide is measured through specific antibody-based detection, in conjunction with streptavidin-coated donor and anti-IgG antibody-coated acceptor beads.

Preprocessing of data

The files obtained from PubChem in the Structure Data Format (SDF) were used to generate 2D molecular descriptors using PowerMV [27], a popular toolkit for computing chemical descriptors. PowerMV generated a total of 179 molecular descriptors describing the physicochemical properties of the molecule (like hydrogen bond donors, acceptors, number of rotatable bonds, charge, polarizability, aromaticity etc.). The descriptor file generated was saved in comma separated (CSV) format. The attributes having bit string descriptor values of only one value throughout the dataset (all 0â€™s or all 1â€™s) were filtered out. The list of the descriptors generated and used is available as Supplementary file 1.

Processing of data: Model Building

The files saved in CSV format were then converted to ARFF ( Attribute Relation File Format) compatible with Weka. The models were built using different classifications viz. Naive Bayes and Random Forest as described previously [16, 17, 18, 19]. For each base model we have used cost sensitive classifier in order to reduce the false negative rate. This was imperative as the datasets were highly imbalanced with a large inactive set compared to the active set. The misclassification cost is incremented on false negatives until the false positive rate reduces to 20%. Assignment of the cost is random irrespective of the base classifier used. Cost sensitive classification is based on cost-proportionate weighting of the training examples, which can be realized either by applying weights to the classification algorithm , or by resampling of instances according to costs. We used Weka 3.6 (Waikato Environment for Knowledge Analysis) [28] which is a toolbox for data mining and machine learning algorithms. Weka provides a simple GUI supporting data from various sources and in different file formats. It has multiple algorithms (including that of regression, association rule mining, clustering, classification etc.) and pre-processing tools that allows comparison of different methods. The workbench is used for both supervised as well as unsupervised algorithms. The data visualization facilities helps in easy access and analysis of results. The dataset was then further randomly split into training and test sets. The training data comprising maximum part of complete data (80%) was used to train the model using different algorithms and the test set (20%) was to evaluate each of the models. All computation was performed on CDAC-Garuda supercomputing facility using the the OSDD-Garuda web interface.

Cross Validation

The models constructed were then tested on the test set which is encountered first time to the model and the objects are classified. Sometimes, in a randomly generated training set, one or more classes might be missed out consequently giving wrong prediction. In such cases the data is divided in such a way that each class is represented in both sets (stratified hold-out). Since inefficient for smaller datasets, k-fold cross validation is used where the entire data is divided into k subsets (folds) of equal sizes and training is done for (k-1) sets and testing is done on one set. The process is repeated k number of times so that each set is tested at least once. The average error rate is computed for all tests. We have used 5-fold cross validation here since the dataset was large.

Model performance evaluation

The 2X2 confusion matrix used by Weka contains the following values:

True positives (TP): class members classified as class members.

True negatives (TN): class non-members classified as non-members.

False positives (FP): class non-members classified as class members.

False negatives (FN): class members classified as class non-members.

We used the following measures for the statistical evaluation of the models:

Sensitivity is the proportion of actual positives which are predicted positive, i.e.TP / (TP + FN).

Specificity is the proportion of actual negatives which are predicted negatives,i.e. TN / (TN + FP).

ROC is receiver operating characteristic curve which is a 2D curve parametrized by one parameter of the classification algorithm, e.g. some threshold in the true positive rate /false positive rate.

The Matthews correlation coefficient (phi coefficient) is a measure of the quality of binary (two-class) classifications. The MCC is a correlation coefficient between the observed and predicted binary classifications; it returns a value between âˆ’1 and +1.

Balanced Classification Rate (BCR) introduces a balance in the classification calculated as 1â„2. (Sensitivity + Specificity).

Accuracy is the efficiency of the classifier to predict true values,i.e. TP+TN) /(TP+TN+FP+FN) * 100).

Substructure Search

Aligning structures on a key structural framework help chemists to interpret the influence of substituents on the template as initial point towards understanding structure-activity information. With the aim of finding molecules which have similar properties to act as a drug, similarity search of compounds was done. Library MCS, a tool from ChemAxon [29], based on hierarchical clustering algorithm was used to cluster the molecules and find the potential bioactive substructures. Hierarchical clustering only requires a measure of similarity between groups of data points. Maximal Common Substructure Search (MCS) is the process of finding the largest structure that is a substructure or part of all the molecules in a given set. Initial structures are found at the bottom of the hierarchy. The next level contains the maximum common structures of clusters of initial molecules, subsequent levels provide larger clusters of smaller common substructures.

Minimal MCS size refers to the smallest size of the maximum common substructure searched for by the algorithm. For different datasets different values were considered owing to the number of top level clusters found and the level count. For AID 504332 and 540317, minimal MCS size was taken 9 and for AID 504339 and 2147 it was 10 and 11 respectively. After the clusters were formed using LibMCS, we got the molecular scaffolds in the form of sdf and SMILES file. The active and inactive 3D structure file were then used to search the similar substructures with the smiles generated. This was done using the jcsearch algorithm of ChemAxon [30]. The similarity is calculated on the basis of theÂ molecular descriptor or fingerprintÂ of the chemical structures to compare. The Substructures were evaluated for enrichment using chi-square test. The p-values were used to evaluate the significance of enrichment. The substructures which had at least > 1% matches among the active dataset entries, p-value less than 0.01 and enrichment factor more than 2 were considered significant.

Results

The datasets obtained from PubChem were processed to generate 2D molecular descriptors using PowerMV. The descriptors were finally culled to 155 from 179 descriptors after removing useless values (Supplementary table 1). The complete data after splitting into train and test sets was loaded in Weka-3.6 to build different classifier models for the evaluation of compounds. Initially standard classification of the data was performed. But because of the skewness in data the FP rate was predicted very low. Thus, cost sensitive classification was introduced. The misclassification cost was applied on false negatives and incremented until the rate of false positives reached 20%. The costs applied in different datasets for different models are shown in Table 1. Naive bayes used minimum cost for the classification of the objects.

Evaluation of models included various statistical parameters. The accuracy of Random forest was predicted to be the highest for all the datasets. A comparison between the sensitivity of both the classification models amongst different datasets was made depicting the sensitivity of Random forest more than the Naive Bayes in all cases. Figure 2 shows the plot between sensitivity of Naive Bayes and Random Forest amongst AID 504332, 504339, 2147 and 540317. Similarly the specificity was compared where Naive Bayes outperforms in AID 504339 and 2147. In case of AID 540317 specificity of both was comparable and Random Forest showed higher specificity in AID 504332. Figure 3 is the comparative graph between the specificity of both classification models amongst all four datasets.

The sensitivity and specificity were used to calculate the balanced classification rate for each model. Random forest showed the most balanced classification out of both. As a measure of quality Matthews correlation coefficient (MCC) was calculated. This balanced measure could be used with classes of different sizes describing a correlation coefficient between the observed and predicted binary classifications. The statistic is also known as the phi coefficient. Table 2 shows the classification results of all the datasets along with the statistical evaluation.

A perfect test would have 100% sensitivity and 100% specificity. It would positively identify all the true cases of active drugs, and it would never mislabel anything. In realistic scenarios, we try to strike a balance between sensitivity and specificity. For that, a relation of sensitivity and specificity on a graph, called a "ROC curve". (ROC means Receiver-Operator-Characteristic) was plotted. Figure 4 summarises the ROC plot for Random Forest classification model for the four datasets.

Evaluation of significantly enriched scaffolds

In the process of drug discovery the local similarity between the structures proved to be useful in designing of new chemical compounds as potential drugs. We used JChem module, LibMCS and clustered the active compounds of all the datasets.

Clustering analysis of AID 504332

The 30875 active compounds clustered into a total of 5,150 clusters of which the 726 top level cluster compounds were considered. The compounds were clustered upto level 6 out of which 258 singletons were removed. The enrichment and its significance, was analyzed by chi-square test. Analysis revealed 19 significantly enriched scaffolds which had p-value less than 0.01 and an enrichment factor > 5.

Clustering analysis of AID 504339

A total of 16919 active compounds were clustered upto 6 levels at MCS size 10. 416 singletons were removed and 1026 compounds obtained at top level were taken for further analysis. We obtained 9 substructures prioritized by p-value (less than 0.01) and enrichment factor > 5.

Clustering analysis of AID 2147

The 3523 compounds were clustered keeping MCS size as 11, we obtained 3791 total clusters. A total of 702 compounds were obtained at level 5 out of which 365 singletons were removed. The final prioritization was done keeping p value less than 0.01 and enrichment factor >5, the analysis resulted in 9 sub structures.

Clustering analysis of AID 540317

The 2142 active compounds were clustered upto 6 levels keeping MCS size as 9. We obtained 216 compounds at top level after removing 93 singletons. Analysis revealed 8 significantly enriched scaffolds which had p-value less than 0.01 and an enrichment factor > 5.

Discussion

Understanding the function and regulation of epigenetic modifier proteins have been recently an actively pursued area of research [31]. This has been more so, with the increasingly understood mechanisms of epigenetic regulation in a number of disease pathophysiologies. Targeting these proteins as potential drug targets have been extensively discussed and pursued [32,33]. Testing large libraries of molecules for specific biological activities are usually time consuming and extremely costly. Computational methods for pre-selecting molecules from large libraries would offer a plausible time and cost-effective alternative [34]. It has been suggested that accurate methodologies to pre-select molecules for in-depth biological assays would accelerate the process of drug discovery. Machine learning approaches have been extensively now used for building predictive models for pre-selecting molecules form large molecular databases [16-19]. In the present report, we create accurate cheminformatics models based on chemical descriptors and artificial intelligence for specific biological activities against four well studied epigenetic modifiers. In the future, many such computational models could be integrated to provide for desirable set of properties or biological activities and has the potential to be the mainstay of drug discovery [35]. The study is not without caveats, the first being the paucity of data sets in public domain encompassing inhibitors for a large number of epigenetic modifiers precludes us from creating a comprehensive suite of predictive models, which could be eventually possible with more data sets being available in public domain. Nevertheless it has not escaped our attention that this is the first and most comprehensive report of cheminformatics models for inhibitors of epigenetic modifiers.

Acknowledgements

The authors thank Dr. Chetana Sachidanandan and Dr. Shantanu Sengupta for reviewing the manuscript. Authors also thank discussions with members of the Open Source Drug Discovery (OSDD) Programme. SJ acknowledges a Project Fellowship from CSIR-OSDD. Authors acknowledge the support from National Knowledge Network (NKN) and CDAC-Garuda for computing. Authors acknowledge technical help and support from CDAC-Garuda team, especially Ms Janaki Chintalapati , Ms Mangala N and Dr Subrata Chattopadhyay. This work was funded by the Council of Scientific and Industrial Research (CSIR), India through Grant BSC0122 (CARDIOMED)

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now

Computational Modeling Of Small Molecule Inhibitors Of Epigenetic

Background

Results

Conclusion

Dataset Source

Preprocessing of data

Processing of data: Model Building

Cross Validation

Model performance evaluation

Substructure Search

Evaluation of significantly enriched scaffolds

Clustering analysis of AID 504332

Clustering analysis of AID 504339

Clustering analysis of AID 2147

Clustering analysis of AID 540317

Our Service Portfolio

Want To Place An Order Quickly?

Do not panic, you are at the right place

Get 20% Discount, Now £19 £14/ Per Page14 days delivery time

Get An Instant Quote

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time