Drug Discovery Qsar To Predictive Materials

Published Date: 02 Nov 2017

Ke Wu, Bharath Natarajan, Lisa Morkowchuk and Curt M. Breneman*

Department of Chemistry and Chemical Biology

Rensselaer Polytechnic Institute

110 8th St

Troy, NY 12180

Abstract

The Materials Genome Initiative (MGI) was conceived as a unified effort to capture, curate and exploit materials structure/property information on a grand scale to enable rapid, cost-effective development of novel materials with predictable properties. While the use of "genomic" methods to facilitate property prediction, virtual design and discovery of materials is relatively new, the concepts driving the development of Materials Informatics are based, solidly, on the lessons learned during the development history of Cheminformatics and Bioinformatics. This chapter describes some of the ways in which Cheminformatics and machine learning methods have been adapted for, and utilized in, materials science and engineering applications. Examples of how Materials Quantitative Structure-Property Relationship (MQSPR) models are created, validated and utilized are presented.

1 Historical Perspective

Research designed to identify relationships between the structures of molecules and their properties has been ongoing for nearly 150 years, forming the basis for modern chemistry and material science. Some of the earliest work in this area was published in 1868 by Brown and Fraser, who studied the connection between the chemical and physiological characteristics of compounds, laying the groundwork for modern structure-based drug design.1 It was fairly evident to the scientific observers of the time that the physical properties of a substance were inextricably related to its structure and composition. This notion has since provided a strong incentive for researchers to find ways to quantify the aforementioned relationship. Unfortunately, the details of the interactions responsible for any specific observable molecular or material property are often difficult or impossible to calculate explicitly or directly, even when using rigorous physics-based modeling methods. Although a priori computational techniques are gaining in importance with the availability of ever faster supercomputers, the development of heuristic methods, such as Quantitative Structure-Property Relationship (QSPR) modeling, provides an easier means to find multivariate (and often non-linear) functions that relate sets of calculable or easily observable quantities called descriptors to known molecular or material properties. This informatics-based approach captures the essence of the controlling physics through numerical descriptors, thereby enabling realistic property prediction.

Quantitative structure-activity/property relationship (QSAR/QSPR) models (for bioactivity in small drug-like molecules) have, for long, been highly relevant in the area of drug discovery, but their inconsistent performance in real-world applications has resulted in their being periodically praised and condemned.2-4 One of the reasons for this checkered history of heuristic modeling is that these models can easily be misused. They are often constructed without proper validation, and are applied beyond their domains of applicability (with predictably poor results). Heuristic multivariate functions can be highly domain-specific and care must be taken to determine the scope of their applicability. An important assumption made in informatics is that the chemical space within the domain of applicability of the heuristic (or MQSPR) models is smooth, i.e. a small change in the structure leads to a small change in properties.5-7 In practice, this is only so when the chosen descriptors adequately represent the factors responsible for the observed properties. Sometimes an unexpected â€˜property cliffâ€™ within the model design chemical space can break the assumed smoothness.8 Models built based on such â€˜roughâ€™ structure/property landscapes can only be useful if the smooth regions can be distinguished from the rough ones, by defining, adequately well, the domains of applicability9.

Additionally, modern QSAR/QSPR models are far more complicated than traditional multiple linear regression methods due to the availability of a multitude of descriptors and the power of non-linear machine learning methods. It is therefore advised that more caution be exercised when using them. The complexity and flexibility of a heuristic model, together with the potential for inappropriate choices of descriptors can lead to over-fitting and misinterpretation by users.10 It is important to note that all heuristic QSAR/QSPR models function by identifying correlations between the variance of a number of structural descriptors and the properties of interest, without revealing actual causality. With this in mind, it is necessary for practitioners to validate any model they construct using thorough "best practices" cross-validation and y-scrambling methods to determine its statistical significance.11 These protocols developed in relation to QSAR/QSPR modeling render themselves useful to the MQSPR community in circumventing the ubiquitous problem of misuse. Figure 1 illustrates the three main components involved in producing good (M)QSPR models. Further details of model construction and validation methods are discussed in Section 3.

Figure 1

As discussed earlier, the choice of descriptors is a very important aspect of QSAR/QSPR modeling. Before discussing descriptors appropriate to MQSPR modeling, a brief review of the general characteristics of molecular descriptors is presented. Early descriptors involved simple features, such as atom composition1 and group constitution12-15. Further development in descriptor technology led to the identification of spectroscopy-related, topological, geometrical, quantum chemical and local molecular surface property descriptors. This explosion of descriptors has resulted in the user having to choose between literally thousands of possible descriptors for creating any given heuristic model, but this chosen set of descriptors has to be of suitable size. A small subset of descriptors might be insufficient in adequately capturing the structural and molecular information, whereas a large one might contain many fortuitous correlations. This is because in machine learning methods, chemical behavior is considered to be controlled by only a few "latent variables16" which are not usually directly calculable, but are often represented as either linear or non-linear combinations of the numerical descriptors available to the model. Unless sufficient chemical information pertinent to the observed chemical or material property is contained in the available descriptors, the latent variables will not be properly represented, and the resulting model will not be expected to perform well. The inclusion of too many descriptors that are not important in latent variables can result in the over-fitting phenomenon described earlier. The science of "feature selection" has necessarily evolved to assist in determining which descriptors should be used to address a particular problem. Feature selection can be Objective or Subjective. The former refers to choosing appropriate descriptors based on their variance, and eliminating those that correlate too highly with other descriptors. The Objective label means that descriptors in the training dataset are selected or pruned prior to model construction, while the latter label (Subjective) refers to selecting descriptors on the basis of their effect on modeling performance on a validation set of molecules. Both methods have been appropriately used, but it is noted that Objective feature selection often results in the models having larger domains of applicability, and are less likely to be over-determined.

In either case, a suitable set of chemical descriptors is selected by striking a compromise between simple descriptors like atom-types, which contain no connectivity information, and complicated descriptors such as those derived from IR frequency/intensity patterns. These descriptors contain reduced information about a structure and are often difficult to interpret. It should be recognized that no machine learning method can extract good correlation information from descriptors that are poorly chosen or badly designed for a specific modeling problem. The identification of "good" descriptors requires sufficient knowledge of the target problem and also knowledge of the physicochemical basis of the macroscopic properties to be modeled and predicted. The choice and design of descriptors for specific materials applications is discussed in more detail, in Section 2.

Exploratory QSPR work on materials properties began more than 20 years ago. Early work was focused on identifying correlations between the structure of polymer repeat units and the glass transition temperature (Tg) and mechanical properties of uncrosslinked systems.17-19 Additional studies were aimed at predicting the Tg of a wider variety of polymers20-38, as well as their mechanical and electrical properties.21, 27, 39-51 The QSPR methodology has since been applied to a wide variety of materials, including fullerenes and nanomaterials,52-81 composites and nanocomposites,82-84 catalysts,85-105 ceramics,106-113 and liquid crystals,114-116 among others. The major difference between these QSPR methods, appropriate for use on materials, and QSAR methods for drug-like molecules is the variety of length scales and structural features of the materials involved. Most materials applications involve macroscopic structures with large numbers of repeating units, which may or may not have crystalline order. Large systems such as polymers or composites can be difficult to represent using the same kinds of descriptors that are used for small drug-like molecules, which have well-defined conformational and electronic properties that can be treated more directly. General-purpose descriptors like electronic, topological or shape descriptors are often unusable for capturing the latent variables of many material properties. Specialized descriptor and model combinations are therefore required for QSPR on materials. In addition, descriptors related to the process parameters used in material fabrication are also necessary, since processing plays a significant role in determining bulk material behavior.112 As discussed earlier, the predictability of a given material property is determined by how much necessary information the available descriptors contain. Remarkably, quite simple descriptors can be rich in implicit information, and can be used to accurately represent a complex observed response. For example, recent publications show that good predictions of Tg of linear polymers (of mixed or variable tacticity) can be made using topological descriptors derived only from the structure of repeat units.34, 35 Often, more complex models combine material QSPR with other computational material methods, like Density Functional Theory (DFT), Molecular Modeling (MM) and Finite Element Methods (FEM) to jump across length scales. The integration of these physics-based models with MQSPR methods is discussed further in Section 4.

In the interest of completeness, it should be mentioned that not all materials modeling involves large systems. Several industrial applications employ small molecules as functional materials, for example, dyes for photovoltaic devices, brominated fire retardants (BFRs), ionic liquids, plasticizers and specialty solvents. Such systems have also been the subject of MQSPR investigations, and the techniques used to create models for these materials do not differ principally from those used in traditional QSAR,84, 117-129 even in the case that they may necessitate the use of novel descriptors to address special properties. One of the several emergent applications of MQSPR for small molecules is in predictive materials toxicology- a field aiding in the prevention of serious health and liability issues for materials manufacturers. Work in MQSPR-based predictive toxicology has been reported in the literature for brominated flame retardants (BFRs),128, 129 nanoparticles72-80 and ceramic fibers.130

This Chapter now delves deeper into each of the individual aspects of MQSPR modeling.

2 The Science of MQSPR: Choice and design of material property descriptors

The importance of appropriate descriptors of QSPR can never be overstated â€“ they represent the basic language of chemistry. The primary function of a descriptor (or descriptor set) is that it can represent pertinent differences between structures/systems through its variance within a dataset. When appropriate descriptors are used, chemical (material) property space will appear smooth, and the applicability domains of the resulting models will be large. Descriptors can be derived from chemical constitutions, topological properties based on graph theory, conformational parameters, physical properties from other simple QSPR models (e.g. CLogP131), as well as electronic properties (such as EP) calculated either directly or indirectly from quantum calculations.

Descriptors can also be designed to represent complex structures in the form of "fingerprints".132, 133 For example, a fingerprint descriptor could be a bit string in which each bit represents the presence or absence of a particular atom or functional group at a given location in the structure. The advantage of this approach that it is easier to interpret, but the disadvantage is that any case containing a sub-structure not included in the fingerprint library would be outside of the applicability domain of the resulting model. Atom-type descriptors are also a way of counting the number of atoms within specific bonding environments in each structure, which may be simpler than using sub-structure fingerprints as the number of atom types are far fewer than the number of possible bit string sub-structure combinations. Except for topological and electron-density based descriptors, there is a compromise between the cost of a developing a large parameter library and the ability to represent complex interactions. Descriptors calculated using quantum calculations represent the local properties of the electron density, which are seen as more general and able to capture more synergetic effects, but at a much greater computational cost. 134-136

More details regarding commonly used descriptors in QSAR/QSPR can be found in Todeschini and Consonniâ€™s book.137 Some typical categories of descriptors that can be used in MQSPR applications are discussed in the following section.

2.1 Constitutional descriptors and group contributions

Non-parameterized constitutional descriptors refer to features that count the number and connectivity patterns between atoms or specific sub-structures within a chemical entity. They constitute the most straightforward type of MQSPR descriptors, and have the largest range of use for different types of materials.108-112, 138, 139 The term Quantitative Composition-Activity Relationships (QCAR) can be used to describe such models.139-141 The major advantage of constitutional descriptors is that the resulting models are highly interpretable because of their simplicity. If a QCAR model can successfully capture a material property of interest, this indicates that the contribution of each substructure is simply additive, especially for linear models. The connectivity condition is usually ignored. An important limitation of the QCAR method is that the introduction of any new constitutional substructures not contained in the existing training data is automatically outside of the domain of applicability of the model.

Parameterized constitutional descriptors refer to features that attribute certain sub-structures with corresponding parameters. Examples are molecular weight, sum of atomic Van der Waals volumes, number of halogen atoms and the Moriguchi octanol-water partition coefficient.142, 143 These descriptors are designed to consider the separate contributions of each substructure and are usually associated with a very clear physical or chemical meaning. Parameterized constitutional descriptors are more generally applicable than non-parameterized ones, since each substructure contributes an increment of a specific property to the material model, rather than being a simple accounting of its presence.

Group contributions involved in linear free energy relationships (LFER) can be regarded as a specific type of QSPR modeling that utilizes parameterized constitutional descriptors. This method was first popularized for small molecules by Hammett15 and then further studied and modified by others. Hammett described the effect of a substituent in the meta or para position of the benzene ring upon the rate or equilibrium of a reaction using a function proportional to energy:

where K is a rate constant or an equilibrium constant for a substitute reactant, K0 is the corresponding quantity for the un-substituted reactant, Ïƒ is a substituent constant which is determined by the type and position of an R group, and Ï is a reaction constant which is dependent upon the nature of the reaction, the medium and the temperature. 15 Further work included Taftâ€™s inclusion of steric effects144 and Hanschâ€™s work on the adaption of such formulas to drug design145. Group contribution methods rely on obtaining the positionally-dependent parameters for each substituent group using other models or experimental results. This means that each parameter must be obtained beforehand and then stored in a library. Interestingly, these derived R-group parameters can not only be used in the target equilibrium or rate problems, but are more general descriptors that can be used for modeling other properties not directly related to equilibrium shifts or relative reaction rates.146

Binary descriptors are the most simple form of fingerprints. They are usually vectors of 1 and 0s, where each bit represents whether a particular feature is present or not. They are mostly used to act as search keys or determine the similarity between structures, but they can also be used within QSPR regression models.147 Common types of fingerprints include Daylight Fingerprints and MACCS keys. When used as constitutional descriptors, integer counts of specific features are used. An effective use of MQSPR model constructed using constitutional descriptors is shown in Figure 2:

Figure 2

In this example, Lusvardi and coworkers performed a QSPR study on the density of silica-based bioglasses.106 When the descriptor representing the number of X-O-X bridge structures was normalized by the total number of oxygen atoms, the resulting X-O-X bridge fraction was found to correlate to the bioglass density with r2 0.9798. This is a "zeroth-order" MQSPR relationship, was found to be useful and interpretable: The pertinent descriptor was thought to represent the degree of polymerization of the glass network, which might be expected to be related to its density.106

2.2 2-D descriptors

The term "2-D descriptors" here refers to those features calculated using only the 2-D representation of molecule structures by considering only atom type and connectivity information. By definition there is some overlap in the information contained between 2D and constitutional descriptors.148 A major part of 2-D descriptors are "topological descriptors", which come from applying graph theory to the 2-D connectivity representation of a molecule. A typical treatment is to consider each atom as a vertex and each bond as an edge, as illustrated in Figure 3.

Figure 3.

Kier, et al introduced the connectivity descriptor type Ï‡ in 1970s to capture this information.149 The earliest version Ï‡ of simply considered the connectivity of each heavy atom. Later modifications took other atom types into consideration.150 A commonly used topological descriptor function can be written as:

where , is the atom number of the k-th atom; is the number of valence electrons in the k-th atom; is the number of hydrogen atoms attached to the k-th atom; is used to denote the complexity of the descriptor. If is 0, the sum of for each atom is calculated, but if m is 1 the sum of for bonded atoms i and k is calculated. This descriptor considers the inner electrons for the atoms in the third or higher period and treats multiple bonds as multiple edges.

Other commonly used descriptors include Wiener index,151 Balaban J index,152 Kappa shape index,153 and the E-state index154. The topological descriptors are usually calculated based on the adjacency matrix and distance matrix, but other matrices are also utilized for special purposes.

Topological descriptors have been widely used for both drug discovery and material QSPR modeling since they are designed to account for basic shape, branching and other topological features of molecules of any given size. In contrast to constitutional descriptors, which suffer from a lack of connectivity information, topological descriptors incorporate more information about atom types, bond types and other local properties like atomic electronegativity.155

Infinite-chain values of some topological descriptors for polymers are reported.156, 157 However, these are not extensively available since most MQSPR studies on polymers use descriptor characterizations of repeat units or some modification of that approach. A good review on the use of topological descriptors in drug design QSPR modeling was published by Roy in 2004.158

The effective use of topological descriptors in MQSPR models for polymer Tg.21, 23, 37 surface energy,159 viscosity,50 polarizability38 and dielectric constant160 have been reported.

2.3 3-D descriptors

Following the trend of nomenclature used for 2-D descriptors, 3-D descriptors are properties calculated from 3-D representations of structures. Except for some studies that describe the use of distance matrices to characterize 3-D structures,161 most 3-D descriptors belong to one of three categories: shape-related, energy-related and surface-related.

Shape-related descriptors are an extension of constitutional descriptors. Many studies using shape or shape/property hybrid descriptors have been described for ligand-based drug discovery because of the importance of shape compatibility and property complementarity between ligands and target enzyme binding sites.162-166 Some representative examples of the use of this descriptor class to create quantitative models of intermolecular interactions may be found in the recent literature.36, 167

Energy-related descriptors can be calculated through utilization of common semi-empirical methods such as AM1, MNDO and PM3, or through DFT or ab initio methods. Depending upon the level of theory used, these properties can include HOMO and LUMO energies, electrostatic potential values, dipole moments and other related properties. In MQSPR applications, energy-related descriptors are most often used to account for electronic properties, such as the polarizability or refractive indices of polymers.168 Additionally, polymer Tg values have also been modeled using energy-related descriptors.20

Surface property descriptors can be calculated by either empirical methods or semi-empirical quantum mechanical (QM) methods. In contrast to other electronic property descriptors, surface property descriptors represent the distribution of specific electronic properties or values of empirical functions evaluated on the molecular surface. For this purpose, "molecular surfaces" may be defined in a variety of ways: as traditional van der Waals surfaces, electron density isosurfaces, or solvent accessible of Connolly surfaces169. Simple examples of surface property descriptors are polar surface area (PSA) and solvent accessible surface area (SASA). More complex examples include distributions of electrostatic potential (EP) or Active Lone-Pair (ALP) surface values, as shown in Figure 4.

Figure 4.

Combinations of the shape-related and surface-related descriptors can also be designed. Das et al. created the Property Encoded Shape Distribution (PESD) descriptors to classify protein binding sites using both shape and surface property information.170

3-D descriptors are conformationally-sensitive because they are derived from the coordinates of the molecular structures involved. This can be both an asset and a liability, depending upon how well the conformational preferences of the molecule are known. In the case of non-crystalline polymers or amorphous materials, constructing models using traditional 3D descriptors may prove impractical or nonpredictive.

2.5 AIM derived descriptors

Descriptors representing properties of the electron density distribution of a molecule or material can be very useful in (M)QSPR modeling.171 Even in the case of drug-like small molecules, the significant CPU requirements of QM calculations largely limit their use. On the other hand, electron density-derived descriptors based on the theory of atoms in molecule (AIM)172, 173 provide a feasible and fast way to calculate descriptors based on these properties.

Using AIM theory, the electron density around each atomic nucleus in a molecule can be partitioned into unique electronic "basins" that individually satisfy the virial theorem, creating valid quantum subsystems for each atom. The gradient of the electron density around each atom terminates either at the nucleus or at the critical point on the bond path (bond critical points, or "BCP"). The electron density has a minimum at BCP along the bond direction, and a maximum along the direction perpendicular to the bond. Based on the BCPs, the molecule can be partitioned into its constituent atoms for individual analysis, or can be aggregated to provide a variety of electron density property descriptors. Two examples are discussed below.

The BCP space concept was first suggested by Popelier174 who then used it to perform similarity comparisons between pairs of molecules. In this approach, each bond is represented by three characteristic properties of its BCP: the electron density, the Laplacian of electron density and the bond ellipticity. Similarity between two molecules with the same structural scaffolds can be measured by calculating the Euclidean distance between all the BCPs in the space spanned by those three properties. Since this approach results in a distance matrix representation of the similarities between molecules, it represents a form of "kernel" descriptor that can be used for creating either regression or classification models of observed properties.

An earlier example of descriptors based on AIM theory is Breneman et al.â€™s Transferable Atom Equivalent (TAE/RECON) descriptors.175, 176 Instead of using the properties based on the BCP to compare structurally similar molecules, the TAE/RECON method reconstructs new molecular scaffolds using a pre-computed fragment library. The fragments are derived from a large number of small molecule examples of a wide variety of bonding environments calculated by ab initio methods and partitioned by BCPs. The TAE library was constructed by defining a set of common fragments organized according to the TAE types of adjacent atoms. This strategy enables appropriate TAE fragments to be quickly identified by topology, and smoothed into a 3D molecular structure to provide an accurate approximation of the molecular electron density distribution. TAE/RECON descriptors are then calculated from the distributions of various electronic properties of the surface of reconstructed molecules, and on the electronic field properties. The calculation times for RECON descriptors are very short (< 1 sec per structure on modern workstations), and are linear with both the number of atoms and the size of the library. By using distance-based autocorrelation descriptors based on TAE properties, both 2D and 3D information can be captured using this technique.167

RECON/TAE descriptors can be viewed as an advanced version of parameterized constitutional descriptors. Due to the fragment reconstruction approach of RECON/TAEs, initial conformational information for training molecules is not necessary; instead, reasonable 3D geometries are created during descriptor generation based on the assumption that atoms with similar bonding environments should have similar electron density properties.

2.6 Vibrational Spectral descriptors for Materials Applications

Apart from descriptors derived through atomistic or group contribution approaches, those from spectrometry/spectroscopy are enriched in important structural information, and can produce models with good physical interpretability. The input spectra can be obtained either from experiment177-179 or from ab initio QM calculations.180, 181 In contrast with standard structure-derived descriptors, those obtained from experiment can hardly be used for virtual design of new materials, but can provide valuable insights into the relationships of easily observed physical observables with others that are more difficult to determine. These and other types of descriptors have been devised and utilized within both biological and materials domains, and represent only a few of the ways that chemical, biological or material properties can be captured at different length scales.

3 Mathematical methods for QSPR/QSAR/MQSPR

3.1 Methods and Machine Learning Workflow

A variety of machine learning methods can be used to find quantitative relationships between sets of descriptors and target properties (response). Which strategies are appropriate depend on the type of outcome desired: classification, rank ordering, or regression. Classification models are designed to assign a substance to one of a given number of categories such as "active" and "inactive", or "above 1" and "below 1"; they can be used to separate groups of molecules according to the presence or absence of a target property. Ranking models output the order of molecules for a specific property; they can be used for systems where the exact value of response is not important, but establishing the priority of one case over another is more significant for problem solving. Regression models seek to determine a function that can represent a continuous hypersurface that relates descriptor variance to observable responses in chemical space. Regression models are used for problems where a real number value of particular property is needed (such as melting point or pKa). High quality experimental training data is necessary in addition to appropriate sets of descriptors to make predictive regression models. It may be the case that a particular combination of dataset and descriptor set can be modeled effectively using any of the three methods, but the quality (error bars and outliers) of the experimental response data and the availability of appropriate descriptors may favor one approach over another. For example, when the relationship between a set of descriptors and a given response is weak or highly non-linear, it may be possible to create useful classification or rank-order models, but attempts at regression training may yield highly overfit and nonpredictive models.

Machine learning methods can be either supervised or unsupervised. Supervised learning uses experimental responses ("labels") and descriptors ("features") to train the model, while unsupervised learning focuses on the structure of the data as represented by the descriptors alone. Examples of unsupervised learning are clustering (through hierarchical trees, Kohonen Neural Network maps, as well as other similarity association methods)182, 183 and singular value decomposition (SVD)184. In this section, we focus on supervised methods, where the responses are known for the training set, and the output of the model is a predicted value of molecular or material property.

Given the need to extract chemical property information from a number of potentially non-orthogonal descriptors (a task unsuitable for multiple linear regression), Principle Component Analysis (PCA)185 is often used in one of several forms. In its simplest incarnation, PCA can be thought of as a linear transformation that creates a set of orthogonal coordinates from a "basis set" comprised of the original descriptors. In doing so, the data in original descriptor space is rotated into PCA space, where each coordinate is orthogonal to all others. There will be as many Principle Components as there were original descriptors, but convention dictates that the number of PCs used for modeling can be truncated at three or four dimensions for most properties of interest without sacrificing model quality. Figure 5 illustrates a 2D PCA transformation, while Figure 6 shows the first three PCs of a higher dimensional transformation. While the resulting PC-based coordinate system is fully orthogonal and suitable for building linear models, the fact that each PC represents a linear combination of descriptors can make model interpretation problematic in terms of the original descriptors.

Figure 5

Figure 6

An interesting and useful feature of 3D PCA modeling is that it incorporates elements of both supervised and unsupervised learning. Examination of Figure 6 shows that transformation of this dataset resulted in some clustering of cases in the first three dimensions of PCA space, which can reveal inherent similarities between them. Additionally, the coordinates of each case within PCA space can be used as descriptors themselves for producing regression models. One important extension of PCA is Partial Least-Squares (PLS) Regression.186 PLS is a supervised technique in which both the descriptor matrix and the label vector are projected into a new space, so that resulting principle components are aligned in directions that explain the greatest variance in the response. This results in a models that are tolerant of non-orthogonal descriptors and resistant to overtraining.

The history of heuristic model building has shown that models constructed in this manner can be quite useful and predictive, but can suffer from fortuitous correlations within the data used to train them. To minimize this "sampling risk", the training set can be subdivided into random groups which are then used to create models to predict the values of the cases left out of those groups. This approach is called "bootstrap modeling", and is a robust form of model crossvalidation. Figure 7 illustrates an example of the three-round/three-fold crossvalidation of a regression model. In each of the three rounds shown, the original training data is randomly partitioned into three parts which are separately used to produce regression models to predict the response values of the cases in the remaining folds, and this process is performed three times. This results in a "bootstrap aggregate" model (actually a set of independent models) whose results are combined to produce the final predictions, as well as an indication of the variance of each prediction. In practice, it is common to use ten or more rounds of ten-fold crossvalidation to create a robust regression model.11 Note that this approach can be used to improve the modeling performance of any linear or nonlinear regression method â€“ PLS, Support Vector Machine Regression (SVR)187, 188 or Artificial Neural Networks (ANNs)189. The distribution of predictions also provides an estimate of the quality of the aggregated models for making quantitative predictions.

Figure 7.

An additional criteria of model quality has also become part of the "best practices" of (M)QSPR modeling. This step is used to estimate the susceptibility of any set of dataset/descriptors/learning method to overfitting. The technique is called "Y-scrambling", and consists of multiple attempts to produce models from the real dataset for which the responses (labels) are randomly scrambled between cases. In each attempt, the full modeling protocol described above is applied, and the quality of the scrambled models (often r2 values or RMS errors of the training models) are assessed relative to the models produced using "real" (unscrambled) data. The concept is that a valid heuristic model can only be produced from a training set if techniques used do now allow scrambled models to be created that fit the scrambled data as well as the real models fit the real data. Figure 8 illustrates these two situations.

Figure 8.

In each of the figures shown above, the Y coordinate represents training model r2 values, while the X axis represents the correlation coefficient between each of the scrambled response vectors and the real response vector. As shown in the left-hand figure, each of the scrambled models exhibit low r2 values which are easily distinguished from the high r2 value shown for the real model. All of the scrambled models were created using response vectors with low correlations with the real response vector, resulting in their being on the far left of the X axis in that figure. The modeling method used here is therefore more trustworthy than the one shown in the right-hand figure. In that case, the scrambled models have nearly the same apparent training performance as the real one, indicating a real risk of producing a spurious model that is made up of only fortuitous correlations, and therefore would not be predictive.

Figure 9 illustrates the workflow required to create a modern, validated QSAR/QSPR model for application to true unknowns. Before any feature selection or model parameter derivation, a representative sample of the original dataset is put aside as an external test set while the remaining labeled data is used to train and validate the model as described above. The external test set, if the responses are known, can be used to test the likely â€˜out of sample errorâ€™ of that particular combination of descriptors, dataset and modeling method or it can be applied to true unknowns.

Figure 9

If the chemical space is properly sampled, the quality of predictions made using this workflow will be similar to later "blind" predictions made on true unknowns, provided that they are within the domain of applicability of the model.

3.1 Experimental data

For materials QSPR, the quality of experimental data (the responses) requires particular attention. Just as the descriptors need to include specific physically relevant information about the phenomena being modeled, experimental data culled from multiple sources must be treated with extreme caution in order to avoid systematic errors and the introduction of unknown factors. Systematic errors often emerge when different methods are used for measuring a property, or even when the work is performed in different laboratories. Unknown factors, in the context of Materials Informatics, can be sample-dependent, or can be introduced during the manufacturing process. These can include impurities and defects. This complication often plagued early QSAR/QSPR modeling efforts in the pharmaceutical industry due to the variance in assay methods and biological responses. In the field of Materials Informatics and MQSPR, more accurate and consistent responses are possible due to the standardization of measurement methods, but they can still confound modeling efforts. To avoid such problems, sufficient knowledge of the consistency of the manufacturing process and the ability to detect, evaluate and control impurities and defects are required. Another challenge for Materials QSPR is the current scarcity of available data. Large datasets are needed to produce models of broad applicability. Fortunately, the Materials Genome Initiative provides a mechanism to address this problem. In practice, small datasets serve to limit the number of descriptors that can be used in a model while still avoiding overfitting, but limiting the number of descriptors can also reduce the available chemical information present in a model. In our experience, a lower limit of 20 data points is required to train a robust model, and ideally several hundred of points should be used.190 Additionally, the ratio of number of descriptors to the size of training set is recommended to be less than 1:5, but with aggressive cross-validation and appropriate selections of descriptors and methods, the ratio of data to descriptors can be reduced.191 Additionally, for unbalanced datasets, where the number of subsets in different regions of chemical property space is significantly different, special treatment must be applied to not bias the models.190, 192

An extensive review on the current mathematical methods for use in creating valid QSAR/QSPR models has recently been published.193

4 Integration of Physical and MQSPR Models for Nanocomposite Materials Modeling

Having discussed the details of descriptor-based model construction, the integration of MQSPR modeling with physics-based continuum modeling is discussed. The effectiveness of such a combination in linking phenomena at different length scales is demonstrated using polymer nanocomposites as a model system. Polymer nanocomposites (PNC) are complex material systems in which the dominant length scales, i.e. the particle sizes, radii of gyration of the polymers and the interparticle spacings begin to converge.194 The energetic interactions between the constituent species at this length scale (nanometers) dictate the mesoscale nanoparticle dispersion morphology and interphase polymer properties. These mesoscopic characteristics are further expressed as deviations in the continuum properties of the neat polymer.195 PNC literature has shown these materials to be useful in a multitude of applications, such smart lighting, high voltage transmission, germ resistance, flame retardation and a multitude of other areas.196-199 However, it is recognized that PNCs are yet to realize their market potential due to the inability to design or predict their macroscopic properties using microscopic constituent information.200 This shortcoming is attributed to the lack of understanding of the interactions at various length scales and the physics bridging them.201 A thorough understanding of the underlying physics to enable ab initio prediction of properties would require rigorous multi-scale modeling and the development of novel scale bridging techniques, both of which are extremely time and resource intensive. A novel interdisciplinary approach combining heuristic MQSPR, physics-based continuum modelling and experimental validation is demonstrated to predict the thermomechanical properties of polymers embedded with silane-modified spherical nanoparticles.

The guiding rationale in this study was that the dispersion morphology and interface polymer properties can be predicted based on the surface energetic components (dispersive, polar) of the particle and matrix polymer, using correlations built from experiments.202, 203 These predicted dispersions and interface mobilities can be employed to build 3D continuum FEM models that further predict the bulk viscoelastic properties. Thus, if the surface energies themselves can be augured from properties of the starting functionalized particle and matrix chemistries using MQSPR models, the thermomechanical properties of the resulting nanocomposites could be virtually predicted. The paradigm used for this approach is shown in Figure 10.

Figure 10.

Going up in length scale, the first vital step therefore is to create MQSPR models capable of calculating the surface energetic components of both the matrix and the functionalized nanoparticle.204, 205 To train the required models, literature data of experimentally-determined dispersive and polar components for 30 different polymers206-216 was collected and curated to ensure that all of it was obtained by the same characterization technique (contact angle and geometric mean assumption217). "Best practices" methods were then employed as prescribed earlier to train and validate MQSPR models based on the literature data, utilizing both PLS and SVM regression methods. In order to minimize chance correlations, 100 "bootstrap" models were built to predict values for all the polymers and nanoparticles in the sequestered validation set. The procedure for accomplishing this was performed as described in Section 3. A 20-mer representation of the polymer molecules was found to be sufficiently representative of molecular properties, with conformationally-sensitive energetic parameters converging with increasing degree of polymerization. The polar component was found to be best predicted (r2 0.62) by EP and ALP descriptors. The dispersive component, which represents the strength of instantaneous dipole interactions, was best predicted (r2 0.88) using EP and local average ionization potential descriptors.

Having now predicted the surface energies of the participating species, the second critical length scale leap would be possible if correlations could be drawn between the surface energy components and quantitative descriptors of dispersion and interphase properties. We resort to experiments to obtain data to train separate models to represent these correlations. A chosen set of nanocomposites with a wide range of particle-polymer interactions are studied, including 3 and 8 weight percent samples of Chloropropyldimethylmethoxysilane, Octyldimethylmethoxysilane, and Aminopropyldimethylethoxysilane modified 14nm colloidal silica nanoparticles embedded in Poly(Methyl Methacrylate) (PMMA), Polystyrene (PS), Poly(Ethyl Methacrylate) (PEMA) or Poly(2-Vinyl Pyridine) (P2VP). Quantitative descriptors of dispersion morphology, average interparticle distance (rd) and average cluster size (rc), were obtained by image analysis of transmission electron micrographs. The parameters rd and rc were further embedded in a volume fraction-dependent 2-point correlation function that describes the dispersion state and can be used to reconstruct statiscally equivalent model dispersions.218 Corresponding Tgs and viscoelastic responses could then be obtained from Dynamic Scanning Calorimetry and Dynamic Mechanical Analysis, respectively. We hypothesize that the equilibrium contact angle of the filler on polymer (ï±) and the relative work of adhesion (ï„Wa) dictate the dispersion state. Additionally, it was hypothesized that the relative attraction of interface polymer to the particle over its attraction to its bulk determines its interfacial mobility. This effect is captured in a term called the work of spreading (Ws), which the difference between the work of adhesion of polymer to particle and the work of cohesion of polymer to itself. The equations relating these parameters to the dispersive and polar components have been discussed in detail in earlier work.202, 219, 220 It is simply noted here that dispersion predictions are good for ï±=0Âº and small values of ï„Wa and worsen significantly for ï±>0Âº. A positive Ws is expected to cause an increase in Tg, whereas a negative Ws is expected to show a decrease for the same dispersion state. A better dispersion is expected to cause a larger change in Tg for the same Ws due to a higher surface to volume ratio.

Experimental results were found to qualitatively agree with these hypotheses. It is observed from image analysis that large values of rd and rc are found for composites with ï±>0Âº , while significantly smaller rd and rc are seen for composites with ï±=0Âº. Within the ï±=0Âº, particles are found to agglomerate more for larger ï„Wa values. The sign of Tg change is found to follow the sign of Ws. A dimensionless paramter, xcorr, is defined as a function of ï± and ï„Wa, accounting for the nuances of the dispersion hypothesis. rd and rc are correlated to the energetic parameters through emperically determined functions of xcorr. These functions are used to obtain statistically equivalent model reconstructions of the composites from the predicted surface energies. This reconstruction is performed by placing particles in a simulation box and running an annealing algorithm until the specified 2-point correlation function is satisfied.221 After reconstruction, a single system (PS with 3 weight % Chloro-functional-silane-modified-Silica) was chosen as the training system to study the interfacial properties. A two-layer gradient interphase of a chosen thickness is created with a shifted Tg.222, 223 The amount of shifting (Sd) in this interphase layer is determined by matching the FEA simulated ï„Tg with that from experimentation, for this single training system. Sd is found, empirically, to be a function of work of spreading. This function was applied to the interface layers in the other reconstructed systems. The FEA results from the combined workflow show an excellent match between the ï„Tgs from simulation and experiments. Even the simulated viscoelastic responses were found to match remarkably well. This example successfully demonstrate a new paradigm in which MQSPR modeling is combined with physics-based methods to predict the thermomechanical properties of polymer nanocomposite materials. This combination of methods has far reaching implications for the materials community, and opens a door to similar future collaborations between scientists that enable the virtual design of materials.

Section 5 The Future of Materials Informatics Applications

As we have described in this chapter, a large body of knowledge has been collected concerning the proper application of heuristic, descriptor-driven models by virtue of hard experiences in pharmaceutical applications. When the best-practices methods defined in that domain are applied within the area of materials informatics, similar successes can be achieved. As shown in section 4, synergistic approaches where MQSPR methods are used in conjunction with physics-based models in a predictive workflow show special promise. As the Materials Genome Initiative continues to evolve, both the availability of high quality data and the need for effective modeling tools will both expand exponentially. The Materials Informatics community stands ready to meet this challenge.

Our Service Portfolio

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

Do not panic, you are at the right place

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now