Why are proteins marginally stable?
D. Taverna and R. A. Goldstein
Proteins, in press.
Most globular proteins are marginally stable regardless of size or activity. The most common interpretation is that proteins must be marginally stable in order to function, and so marginal stability represents the results of positive selection. We consider the issue of marginal stability directly using model proteins and the dynamical aspects of protein evolution in populations. We find that the marginal stability of proteins is an inherent property of proteins due to the high dimensionality of the sequence space, without regard to protein function. In this way, marginal stability can result from neutral, non-adaptive evolution. By allowing protein sub-populations with different stability requirements for functionality to compete, we find that marginally stable populations of proteins tend to dominate. Our results show that functionalities consistent with marginal stability have a strong evolutionary advantage, and might arise because of the natural tendency of proteins towards marginal stability.
The distribution of indel lengths
B. Qian and R. A. Goldstein
Proteins, 45 (2001), 102-104.
Protein sequence alignment has become a widely used method in the study of newly-sequenced proteins. Most sequence alignment methods use an affine gap penalty to assign scores to insertions and deletions. While affine gap penalties represent the relative ease of extending a gap compared with initializing a gap, it is still an obvious over-simplification of the real processes that occur during sequence evolution. In order to improve the efficiency of sequence alignment methods and to obtain a better understanding of the process of sequence evolution, we wish to find a more-accurate model of insertions and deletions in homologous proteins. In this work, we extract the probability of a gap occurrence and the resulting gap length distribution in distantly related proteins (sequence identity less than 25%) using alignments based on their common structures. We observe a distribution of gaps that can be fitted with a multi-exponential with four distinct components. The results suggest new approaches to modeling insertions and deletions in sequence alignments.
Evolution of functionality in lattice
proteins
P. D. Williams, D. D. Pollock, and R. A.
Goldstein
Journal of Molecular Graphics and Modelling
19 (2001), 150-156.
We study the evolution of protein functionality using a 2-dimensional lattice model. We find that characteristics particular to evolution, such as population dynamics and the early evolutionary trajectories, have a large effect on the distribution of observed structures. We find little difference between the distribution of structures evolved for function and those evolved for their ability to form compact structures.
Optimization of a new score function for
the detection of remote homologs
M. Kann, B. Qian, and R.A. Goldstein
Proteins 41 (2000), 498-503.
The growth in protein sequence data has placed a premium on ways to infer structure and function of the newly sequenced proteins. One of the most effective ways is to identify a homologous relationship with a protein about which more is known. While close evolutionary relationships can be confidently determined with standard methods, the difficulty increases as the relationships become more distant. All of these methods rely on some score function to measure sequence similarity. The choice of score function is especially critical for these distant relationships. We describe a new method of determining a score function, optimizing the ability to discriminate between homologs and non-homologs. We find that this new score function performs better than standard score functions for the identification of distant homologies.
How to Generate Improved Potentials for Protein Tertiary
Structure Prediction: A Lattice Model Study
T.-L. Chiu and R.A. Goldstein
Proteins 41 (2000), 157-163.
Success in the protein structure prediction problem relies heavily on the choice of an accurate potential function. One approach towards extracting these potentials from a database of known protein structures is to maximize the Z-score of the database proteins, maximizing the ability of the potential to discriminate correct from random conformations. These optimization methods have an unfortunate tendency to underestimate the repulsive interactions, leading to reduced accuracy and predictive ability. Using a lattice model, we show how this tendency is due to the Gaussian form assumed for the energies of the ensemble of random structures, and show how we can weight the distribution to suppress the high-energy state contribution to the Z-score calculation. The result is a potential that is more accurate and more likely to yield correct predictions than other Z-score optimization methods as well as potentials of mean force.
Optimizing for Success: A New Score Function For Distantly Related
Protein Sequence Comparison
M. Kann and R.A. Goldstein
RECOMB 2000, in press.
The exponential growth of the sequence data produced by the genome projects motivates the development of better ways of inferring structural and functional information about those newly sequenced proteins. Looking for similarities between these probe protein sequences and other protein sequences in the database has proved to be one of the most useful current techniques. This procedure, known as sequence comparison, relies on the use of an appropriate score function that discriminates homologs from non-homologs. Current score functions have difficulty identifying distantly-related homologs with low sequence similarity. As a result, there is an increased demand for a new score function that yields statistically-significant higher scores for all the pairs of homologous protein sequences including such distantly-related homologs. We present a new method for generating a score function by optimizing it for successful discrimination between homologous and unrelated proteins. The performance of the new score function (OPTIMA) on a set of distantly related protein sequences was compared with other substitution matrices of common use, obtained with different methods. OPTIMA performs better than Dayhoff's PAM250, structural derived matrices (JTT), Gonnet et al. matrix (GONN,an improvement of the PAM series), and than the widely used BLOSUM 62(BL62). This improvement can have a big impact in the distinction for increasing amount of protein sequences.
Modeling Evolution at the Protein Level using an Adjustable Amino
Acid Fitness Model.
M.W. Dimmic, D. P. Mindell, and R.A. Goldstein
Pacific Symposium on Biocomputing 2000.
An adjustable fitness model for amino acid site substitutions is investigated. This model, a generalization of previously developed evolutionary models, has several distinguishing characteristics: it separately accounts for the processes of mutation and substitution, allows for heterogeneity among substitution rates and among evolutionary constraints, and does not make any prior assumptions about which sites or characteristics of proteins are important to molecular evolution. While the model has fewer adjustable parameters than the general reversible mtREV model, when optimized it outperforms mtREV in likelihood analysis on protein-coding mitochondrial genes. In addition, the optimized fitness parameters of the model show correspondence to some biophysical characteristics of amino acids.
The Evolution of Duplicated Genes Considering Protein Stability
Constraints
D.M. Taverna and R.A. Goldstein
Pacific Symposium on Biocomputing 2000.
We model the evolution of duplicated genes by assuming that the gene's protein message, if transcribed and translated, must form a stable, folded structure. We observe the change in protein structure over time in an evolving population of lattice model proteins. We find that selection of stable proteins conserves the original structure if the structure is highly designable, that is, if a large fraction of all foldable sequences form that structure. This effect implies the relative number of pseudogenes can be less than previously predicted with neutral evolution models. The data also suggests a reason for lower than expected ratios of non-synonymous to synonymous substitutions in pseudogenes.
Surveying Determinants of Protein Structure Designability across
Different Energy Models and Amino-Acid Alphabets: A Consensus
N.E.G. Buchler and R.A. Goldstein
Journal of Chemical Physics 112 (2000), 2533-2547.
A variety of analytical and computational models have been proposed
to answer the question of why some protein structures are more ``designable''
(i.e. have more sequences folding into them) than others. One class
of analytical and statistical-mechanical models has approach the designability
problem from a thermodynamic viewpoint. These models highlighted specific
structural features important for increased designability. Furthermore,
designability was shown to be inherently related to thermodynamically-relevant
energetic measures of protein folding such as the foldability F
and energy gap .
However,
many of these models have been done within a very narrow focus: namely,
pair-contact interactions and two-letter amino-acid alphabets. Recently,
two-letter amino-acid alphabets for pair-contact models have been shown
to contain designability artifacts which disappear for larger-letter amino-acid
alphabets. In addition, a solvation model was demonstrated to give identical
designability results to previous two-letter amino-acid alphabet pair-contact
models. In light of these discordant results, this report synthesizes a
broad consensus regarding the relationship between specific structural
features, foldability F, energy gap
,
and structure designability for different energy models (pair-contact vs.
solvation) across a wide range of amino-acid alphabets. We also propose
a novel measure
Z which is shown to be well correlated to designability.
Finally, we conclusively demonstrate that two-letter amino-acid alphabets
for pair-contact models appear to be solvation models in disguise.
The Distribution of Structures in Evolving Protein Populations
D.M. Taverna and R.A. Goldstein
Biopolymers 53 (2000), 1-8.
Proteins exhibit a non-uniform distribution of structures. A number of models have been advanced to explain this observation by considering the distribution of designabilities, that is, the fraction of all sequences that could successfully fold into any particular native structure. In this paper, we describe how population dynamics can make the distribution of observed protein structures more uneven than the distribution of designabilites. Additional factors, such as the topology of the sequence space and the similarity of other structures, can also influence this distribution.
Universal Correlation between Energy Gap and Foldability for the
Random Energy Model and Lattice Proteins
N.E.G. Buchler and R.A. Goldstein
Journal of Chemical Physics 111 (1999), 6599-6609.
The Random Energy Model, originally used to analyze the physics
of spin glasses, has been employed to explore what makes a protein a good
folder versus a bad folder. In earlier work, the ratio of the folding temperature
over the glass-transition temperature was related to a statistical measure
of protein energy landscapes denoted as the foldability F. It was
posited and subsequently established by simulation that good folders had
larger foldabilites, on average, than bad folders. An alternative hypothesis,
equally verified by protein folding simulations, was that it is the energy
gap between the
native
state and the next highest energy that distinguishes good folders from
bad folders. This duality of measures has led to some controversy and confusion
with little done to reconcile the two. In this paper, we revisit the Random
Energy Model to derive the statistical distributions of the various energy
gaps and foldability. The resulting joint distribution allows us to explicitly
demonstrate the positive correlation between foldability and energy gap.
In addition, we compare the results of this analytical theory with a variety
of lattice models. Our simulations indicate that both the individual distributions
and the joint distribution of foldability and energy gap agree qualitatively
well with the Random Energy Model. It is argued that the universal distribution
of and the positive correlation between foldability and energy gap, both
in lattice proteins and the REM, is simply a stochastic consequence of
the ``Thermodynamic Hypothesis''.
Estimating the Total Number of Protein Folds
S. Govindarajan, R. Recabarren, and R.A. Goldstein
Proteins 35 (1999), 408-414.
Many seemingly unrelated protein families share common folds. Theoretical models based on structure designability have suggested that a few folds should be very common while many others have low probability. In agreement with the predictions of these models, we show that the distribution of observed protein families over different folds can be modeled with a highly-stretched exponential. Our results suggest that there are approximately 4000 possible folds, some so unlikely that only 2000 folds exist among naturally-occuring proteins. Due to the large number of extremely rare folds, constructing a comprehensive database of all existent folds would be difficult. Constructing a database of the most-likely folds representing the vast majority of protein families would be considerably easier.
Using Physical-Chemistry Based Mutation Models in Phylogenetic Analyses
of HIV-1 Subtypes
J.M. Koshi, D.P. Mindell, and R.A. Goldstein
Molecular Biology and Evolution 16 (1999), 173-179.
HIV-1 subtype phylogeny is investigated using a previously-developed computational model of natural amino acid site mutations. This model, based on Boltzmann statistics and Metropolis kinetics, involves an order of magnitude fewer adjustable parameters than traditional mutation matrices and deals more effectively with the issue of protein site-heterogeneity. After training on sequences of HIV-1 envelope (env) proteins from a few specific subtypes, our model is more likely to describe the evolutionary record for other subtypes than methods using a single mutation matrix, even a matrix optimized over the same data. Pairwise distances are calculated between various probabilistic ancestral subtype sequences, and a distance matrix approach is used to find the optimal phylogenetic tree. Our results indicate that the relationship between subtypes B, C, and D may be closer than previously thought.
Effect of Alphabet Size and Foldability Requirements on Protein Structure
Designbility
N.E.G. Buchler and R.A. Goldstein
Proteins 34 (1999), 113-124.
A number of investigators have addressed the issue of why certain protein structures are especially common by considering structure "designability", defined as the number of sequences that would successfully fold into any particular native structure. One such approach, based on "foldability", suggested that structures could be classified according to their maximum possible foldability and that this optimal foldability would be highly correlated with structure designability. Other approaches have focused on computing the designability of lattice proteins written with reduced two-letter amino-acid alphabets. These different approaches suggested contrasting characteristics of the most designable structures. This report compares the designability of lattice proteins over a wide range of amino-acid alphabets and foldability requirements. While all alphabets have a wide distribution of protein designabilities, the form of the distribution depends on how protein "viability" is defined. Furthermore, under increasing foldability requirements, the change in designabilities for all alphabets are in good agreement with the previous conclusions of the foldability approach. Most importantly, it was noticed that those structures which were highly designable for the two-letter amino-acid alphabets are not especially designable with higher-letter alphabets.
Optimizing Potentials for the Inverse Protein Folding Problem
T.-L. Chiu and R.A. Goldstein
Protein Engineering 11 (1998), 749-752.
Inverse protein folding, which seeks to identify sequences that fold into a given structure, has been approached by threading candidate sequences onto the structure and scoring them with database-derived potentials. The sequences with the lowest energies are predicted to fold into that structure. It has been argued that the limited success of this type of approach is not due to the discrepancy between the scoring potential and the true potential but is rather due to the fact that sequences choose their lowest-energy structure rather than structures choosing the lowest-energy sequences. Here we develop a non-physical potential scheme optimized for the inverse folding problem. We maximize the average probability of success for a set of lattice proteins to obtain the optimal potential energy function, and show that the potential obtained by our method is more likely to produce successful predictions than the true potential.
Optimizing Energy Potentials for Success in Protein Tertiary Structure
Prediction
T.-L. Chiu and R.A. Goldstein
Folding & Design 3 (1998), 223-228.
Success in the protein structure prediction problem relies on the choice of an accurate potential function. For a single protein sequence, Wolynes and co-workers showed that the potential function can be optimized for predictive success by maximizing the energy gap between the correct structure and the ensemble of random structures relative to the distribution of the energies of these random structures (Z-score). Different ways have been described of implementing this procedure for an ensemble of database proteins. Here we demonstrate a new approach to carrying out this task. For a single protein sequence, the probability of success (i.e. the probability that the folded state is the lowest energy state) is derived. We then maximize the average probability of success for a set of proteins to obtain the optimal potential energy function. This results in maximum attention being focused on those proteins whose structures are difficult but not impossible to predict. Using a lattice model of proteins, we show that the optimal interaction potentials obtained by our method are both more accurate and more likely to produce successful predictions than those obtained by other averaging procedures.
On the Thermodynamic Hypothesis of Protein Folding
S. Govindarajan and R.A. Goldstein
Proc. Nat'l Acad. Sci (USA) 95 (1998), 5545-5549.
The validity of the thermodynamic hypothesis of protein folding was explored by simulating the evolution of protein sequences. Simple models of lattice proteins were allowed to evolve by random point mutations subject to the constraint that they fold into a pre-determined native structure using a Monte Carlo folding algorithm. We employed a simple analytical approach to compute the probability of violation of the thermodynamic hypothesis as a function of the size of the protein, the fraction of the total number of possible conformations which are kinetically accessible, and the roughness of the free-energy landscape. It was found that even if the folding is under kinetic control, the sequence will evolve so that the native state is most often the state of minimum free energy.
Models of Natural Mutations Including Site Heterogeneity
J.M. Koshi and R.A. Goldstein
Proteins 32 (1998), 289-295.
New computational models of natural site mutations are developed that account for the different selective pressure acting on different locations in the protein. The number of adjustable parameters is greatly reduced by basing the models on the underlying physical-chemical properties of the amino acids. This allows us to use our method on small data sets built of specific protein types. We demonstrate that with this approach we can represent the evolutionary patterns in HIV envelope proteins far better than with more traditional methods.
Beyond Mutation Matrices: Physical-chemistry Based Evolutionary Models
J.M. Koshi, D.P. Mindell, and R.A. Goldstein
Genome Informatics 1997 (1998), refereed conference proceeding.
We describe a model for characterizing site mutations in evolving proteins. By representing the fitness of each of the amino acids as a function of the physical-chemical properties of that amino acid, and constructing mutation matrices based on Boltzmann statistics and Metropolis kinetics, we are able to greatly reduce the number of adjustable parameters. This allows us to include site heterogeneity in the model, as well as to optimize the model for specific protein types. We demonstrate the applicability of the model by investigating the phylogenetic relationship between various subtypes of HIV-1.
Evolution of Model Proteins on a Foldability Landscape
S. Govindarajan and R.A. Goldstein
Proteins 29 (1997), 461-466.
We model the evolution of simple lattice proteins as a random walk in a fitness landscape, where the fitness represents the ability of the protein to fold. At higher selective pressure the evolutionary trajectories are confined to neutral networks where the native structure is conserved and the dynamics are non self-averaging and non-exponential. The optimizability of the corresponding native structure has a strong effect on the size of these neutral networks, and thus on the nature of the evolutionary process.
Site Mutations in Model Proteins
S. Govindarajan and R.A. Goldstein
Mathematical Modelling and Scientific Computing (1997).
Model proteins can be used to understand the process of site mutations. We simulate the evolution of lattice proteins, requiring that every sequence during the evolutionary trajectory be sufficiently able to fold. We can then study what mutations are accepted, and how these relative mutation rates depend upon surface accessibility. We measure the degree of conservation of the mutation by how much it affects the intramolecular interactions that determine the native structure and the foldability. We find that although substitutions in the interior of the protein are more conservative than those on the protein exterior in terms of substituting similar amino acids, the changes in the interactions are comparable in these two different cases. The advantages of the interaction landscape approach are discussed.
The Foldability Landscape of Model Proteins
S. Govindarajan and R.A. Goldstein
Biopolymers 42 (1997), 427-438.
Molecular evolution may be considered as a walk in a multi-dimensional fitness landscape, where the fitness at each point is associated with features such as the function, stability and survivability of these molecules. We present a simple model for the evolution of protein sequences on a landscape with a precisely defined fitness function. We use simple lattice models to represent protein structures, with the ability of a protein sequence to fold into the structure with lowest energy, quantified as the foldability, represents the fitness of the sequence. The foldability of the sequence is characterized based on the spin glass model of protein folding. We consider evolution as a walk in this foldability landscape and study the nature of the landscape and the dynamics on such a landscape. Selective pressure is explicitly included in this model in the form of a minimum foldability requirement. We find that different native structures are not evenly distributed in interaction space, with similar structures and structures with similar optimal foldabilities clustered together. Evolving proteins marginally fulfill the selective criteria of foldability. As the selective pressure is increased, evolutionary trajectories become increasingly confined to ``neutral networks'', where the sequence and the interactions can be significantly changed while a constant structure is maintained.
Predicting Protein Secondary Structure Using Probabilistic Substitution
Schemata
M.J. Thompson and R.A. Goldstein
Protein Science 6 (1997), 1963-1975.
We demonstrate the applicability of our previously developed Bayesian probabilistic approach for predicting residue solvent accessibility to the problem of predicting secondary structure. Using only single sequence data, this method achieves a 3-state accuracy of 67% over a database of 473 non-homologous proteins. This approach is more amenable to inspection and less likely to overlearn specifics of a dataset than ``black box'' methods such as neural networks. It is also conceptually simpler and less computationally costly. We also introduce a novel method for representing and incorporating multiple sequence alignment information within the prediction algorithm, achieving 72% accuracy over a dataset of 304 non-homologous proteins. This is accomplished by creating a statistical model of the evolutionarily-derived correlations between patterns of amino acid substitution and local protein structure. This model consists of parameter vectors, termed ``substitution schemata'', which probabilistically encode the structure-based heterogeneity in the distributions of amino acid substitutions found in alignments of homologous proteins. The model is optimized for structure prediction by maximizing the mutual information between the set of schemata and the database of secondary structures. Unlike ``expert heuristic'' methods, this approach has been demonstrated to work well over large datasets. Unlike the opaque neural network algorithms, this approach is physicochemically intelligible. Moreover, the model optimization procedure, the formalism for predicting one-dimensional structural features, and our previously developed method for tertiary structure recognition all share a common Bayesian probabilistic basis. This consistency starkly contrasts with the hybrid and ad hoc nature of methods which have dominated this field in recent years.
Compaction and Folding in Model Proteins
T.-L. Chiu and R.A. Goldstein
Journal of Chemical Physics 107 (1997), 4408-4415.
Protein folding is modeled as diffusion on a free-energy landscape, allowing use of the diffusion equation to study the impact of energetic parameters on the folding dynamics. The free-energy landscape is characterized by two different order parameters, one representing the degree of compactness, the other a measure of the progress towards the folded state. For marginally stable proteins, fastest folding is achieved when the non-specific interactions favoring compaction are strong, resulting in a high folding temperature. Such proteins fold by rapid collapse followed by slower accumulation of correct contacts.
Protein Heteronuclear NMR Assignments using Mean-Field Simulated
Annealing
N.E.G. Buchler, E.R.P. Zuiderweg, H. Wang, and R.A. Goldstein
Journal of Magnetic Resonance 125 (1997), 34-42.
A computational method for the assignment of the NMR spectra of larger (21 kDa) proteins using a set of six of the most sensitive heteronuclear multidimensional nuclear magnetic resonance experiments is described. Connectivity data obtained from HNCa, HN(CO)Ca, HN(Ca)Ha, and Ha(CaCO)NH and spin-system identification data obtained from CP-(H)CCH-N TOCSY and CP-(H)C(CaCO)NH TOCSY were used to perform sequence-specific assignments using a mean-field formalism and simulated annealing. This mean-field method reports the resonance assignments in a probabilistic fashion, displaying the certainty of assignments in an unambiguous and quantitative manner. This technique was applied to the NMR data of the 172-residue peptide-binding domain of the E. coli heat-shock protein, DnaK. The method is demonstrated to be robust to significant amounts of missing, spurious, noisy, extraneous, and erroneous data.
Mutation Matrices and Physical-Chemical Properties: Correlations
and Implications
J.M. Koshi and R.A. Goldstein
Proteins 27 (1996), 336-344.
In order to better understand how the properties of individual
amino acids result in proteins with particular structures and functions,
we have examined the correlations between previously-derived
structure-dependent mutation rates and changes in various
physical-chemical properties of the amino acids such as volume,
charge, -helical
and
-sheet
propensity, and hydrophobicity. In most cases we found the
G of transfer from octanol
to water to be the best model for evolutionary constraints, in contrast to
the much weaker correlation with the
G of transfer from cyclohexane to water, a property
found highly correlated to changes in stability in site-directed
mutagenesis studies. This suggests that natural evolution may follow
different rules than those suggested by results obtained in the
laboratory. A high degree of conservation of a surface residue's relative
hydrophobicity was also observed, a fact which can not be explained based
on constraints on protein stability, but may reflect the consequences of
the reverse-hydrophobic effect. Local propensity, especially
-helical propensity,
is rather poorly conserved during evolution, indicating that non-local
interactions dominate protein structure formation. We found that changes
in volume were important in specific cases, most significantly in transitions
among the hydrophobic residues in buried locations. To demonstrate how
these techniques could be used to understand particular protein families,
we derived and analyzed mutation matrices for the hypervariable and framework
regions of antibody light chain V regions. We found a surprisingly high
conservation of hydrophobicity in the hypervariable region, possibly indicating
an important role for hydrophobicity in antigen recognition.
Probabilistic Reconstruction of Ancestral Protein Sequences
J.M. Koshi and R.A. Goldstein
Journal of Molecular Evolution 42 (1996), 413-420.
Using a maximum likelihood formalism, we have developed a method to reconstruct the sequences of ancestral proteins. Our approach allows the calculation of not only the most probable ancestral sequence, but also computes the probability of all amino acids at any given node in the evolutionary tree. Because we consider evolution on the amino acid level, we are better able to include effects of evolutionary pressure, and take advantage of structural information about the protein through the use of mutation matrices that depend on secondary structure and surface accessibility. The computational complexity of this method scales linearly with the number of homologous proteins used to reconstruct the ancestral sequence.
Why are some Protein Structures so Common?
S. Govindarajan and R.A. Goldstein
Proc. Nat'l Acad. Sci (USA) 93 (1996), 3341-3345.
Many biological proteins are observed to fold into one of a limited number of structural motifs. By considering the requirements imposed on proteins by their need to fold rapidly, and the ease with which such requirements can be fulfilled as a function of the native structure, we can explain why certain structures are repeatedly observed among proteins with negligible sequence similarity. This work has implications for the understanding of protein sequence-structure relationships as well as protein evolution.
Predicting Solvent Accessibility: Higher Accuracy Using Bayesian
Statistics and Optimized Residue Substitution Classes
M.J. Thompson and R.A. Goldstein
Proteins 25 (1996), 38-47.
We introduce a novel Bayesian probabilistic method for predicting the solvent accessibilities of amino acid residues in globular proteins. Using single sequence data this method achieves prediction accuracies higher than previously published methods. Substantially improved predictions--comparable to the highest accuracies reported in the literature to date--are obtained by representing alignments of the example proteins and their homologs as strings of residue substitution classes depending on the side chain types observed at each alignment position. These results demonstrate the applicability of this relatively simple Bayesian approach to structure prediction and illustrate the utility of the classification methodology previously developed to extract information from aligned sets of structurally related proteins.
Constructing Amino Acid Residue Substitution Classes Maximally Indicative
of Local Protein Structure
M.J. Thompson and R.A. Goldstein
Proteins 25 (1996), 28-37.
Using an information theoretic formalism, we optimize classes of amino acid substitution to be maximally indicative of local protein structure. Our statistically-derived classes are loosely identifiable with the heuristic constructions found in previously published work. However, while these other methods provide a more rigid idealization of physicochemically-constrained residue substitution, our classes provide substantially more structural information with many fewer parameters. Moreover, these substitution classes are consistent with the paradigmatic view of the sequence--to--structure relationship in globular proteins which holds that the three-dimensional architecture is predominantly determined by the arrangement of hydrophobic and polar side chains with weak constraint on the actual amino acid identities. More specific constraints are imposed on the placement of prolines, glycines and the charged residues. These substitution classes have been used in highly accurate predictions of residue solvent accessibility. They could also be used in the identification of homologous proteins, the construction and refinement of multiple sequence alignments, and as a means of condensing and codifying the information in multiple sequence alignments for secondary structure prediction and tertiary fold recognition.
Correlating Structure-Dependent Mutation Matrices with Physical-Chemical
Properties
J.M. Koshi and R.A. Goldstein
Pacific Symposium on Biocomputing '96, (L. Hunter and T. E. Klein
eds) (1995), 488-499.
We have investigated how structure-dependent mutation matrices
derived in previous work correlate with various physical-chemical properties
of the 20 naturally occurring amino acids. Among the properties we investigated
were G of transfer
from water to octanol and cyclohexane,
helical and
sheet propensity, size, and charge. We found that
the
G of transfer
to octanol had a high correlation with matrices for all categories of
residues, especially the matrices for buried and exposed positions. This
result suggests that octanol is a good model for understanding both the
changes in stability resulting from substitutions of buried residues and
changes in foldability resulting from varying exposed residues. We also
found the correlations of the matrices with size and charge varied with
the local environment, and that neither
helical nor
sheet propensity had high correlations with most
matrices. Thus, conservation of size and charge appear to be important in
specific environments, and conservation of
-helix and
-sheet propensity do not seem
to be key factors.
Optimal Local Propensities for Model Proteins
S. Govindarajan and R.A. Goldstein
Proteins 22 (1995), 413-418.
Lattice models of proteins were used to examine the role of local propensities in stabilizing the native state of a protein, using techniques drawn from the spin-glass theory to characterize the free-energy landscapes. In the strong evolutionary limit, optimal conditions for folding is achieved when the contributions from local interactions to the stability of the native state of the protein is small. Further increasing the local interactions rapidly decreases the foldability.
Context-Dependent Optimal Substitution Matrices Derived Using Bayesian
Statistics and Phylogenetic Trees
J.M. Koshi and R.A. Goldstein
Protein Engineering 8 (1995), 641-645.
Substitution matrices are a key tool in important applications such as identifying sequence homologies, creating sequence alignments, and more recently using evolutionary patterns for the prediction of protein structure. We have derived a novel approach to the derivation of these matrices that utilizes not only multiple sequence alignments, but also the associated evolutionary trees. The key to our method is the use of a Bayesian formalism to calculate the probability that a given substitution matrix fits the tree structures and multiple sequence alignment data. With this ability, we can determine optimal substitution matrices for various local environments, depending upon parameters such as secondary structure and surface accessibility.
Searching for Foldable Protein Structures using Optimized Energy
Functions
S. Govindarajan and R.A. Goldstein
Biopolymers 36 (1995), 43-51.
During evolution, the effective interactions between residues
in a protein can be adjusted through mutations to allow the protein to
fold to its native structure on an adequate time-scale. We seek to address
the question, are there some structures that can be better optimized than
others? Using exhaustive enumeration of the compact conformations of short
proteins confined to simple lattices, we find that the best structures
are those that contain contacts rare in random structures, indicating the
importance of non-local contacts for assisting the folding process. Certain
structural motifs such as long -hairpins, Greek-key motifs, and jelly rolls, commonly found in
proteins of known structure, have a high degree of optimizability.
Contrary to what might be expected, positive correlations between the
various interactions reduce optimizability. The optimization procedure
produces a correlated energy landscape, which might assist folding.
Optimized Energy Functions for Tertiary Structure Prediction and
Recognition
R.A. Goldstein, Z.A. Luthey-Schulten, and P.G. Wolynes
Protein Structure by Distance Analysis (1994), 135-144.
A theoretical basis for the alignment of a protein sequence to a set of protein structure templates is presented, based on a Bayesian statistical analysis. The optimal Hamiltonian for this threading is closely related to the Hamiltonian optimized for molecular dynamics based on spin-glass theory. The Bayesian theory provides the optimal penalty functions for insertions and deletions in the alignment, which can be put in the form of a chemical potential. In contrast to standard methods for determining gap penalities, these penalties involve the logarithm of the probability distribution of gaps in alignments against correct templates as compared to the probability distribution of gaps in alignments against random templates, as determined self-consistently. Sequences of unknown proteins can be aligned to known protein structures, identifying similar structural motifs and generating reasonably correct alignments.
3-Dimensional Model for the Hormone-Binding Domains of Steroid-Receptors
R.A. Goldstein, J.A. Katzenellenbogen, Z.A. Luthey-Schulten, D.A. Seielstad,
and P.G. Wolynes
Proc. Nat'l Acad. Sci (USA) 90 (1993), 9949-9953.
We have used a motif-based structural search method to identify
structural homologs of the hormone binding domains of the nuclear receptors
from among a set of known protein structures and have found the closest
similarity with members of the subtilisin-like serine proteases. These
proteins consist of an open twisted sheet of parallel -strands flanked on both sides
by
-helices.
The alignment with the protease scaffold was refined by using multiple
sequence prealignment of different sets of nuclear receptors, and alternative
model structures were screened by considering their consistency with the
results of biochemical experiments defining the ligand binding pocket.
In the most favored model, nearly all of the residues thought to be involved
in ligand binding map to a pocket of appropriate dimensions where the subtilisin-like
proteases have their active site. The three-dimensional model that we propose
for the hormone binding domains of the nuclear receptors provides a framework
for the design of experiments to further investigate nuclear receptor structure
and function.
Protein Tertiary Structure Prediction using Optimized Hamiltonians
with Local Interactions
R.A. Goldstein, Z.A. Luthey-Schulten, and P.G. Wolynes
Proc. Nat'l Acad. Sci (USA) 89 (1992), 9029-9033.
Protein folding codes embodying local interactions including surface and secondary structure propensities and residue-residue contacts are optimized for a set of training proteins by using spin-glass theory. A screening method based on these codes correctly matches the structure of a set of test proteins with proteins of similar topology with 100% accuracy, even with limited sequence similarity between the test proteins and the structural homologs and the absence of any structurally similar proteins in the training set.
Optimal Protein-Folding Codes from Spin-Glass Theory
R.A. Goldstein, Z.A. Luthey-Schulten, and P.G. Wolynes
Proc. Nat'l Acad. Sci (USA) 89 (1992), 4918-4922.
Protein-folding codes embodied in sequence-dependent energy functions
can be optimized using spin-glass theory. Optimal folding codes for associative-memory
Hamiltonians based on aligned sequences are deduced. A screening method
based on these codes correctly recognizes protein structures in the "twilight
zone" of sequence identity in the overwhelming majority of cases. Simulated
annealing for the optimally encoded Hamiltonian generally leads to qualitatively
correct structures.
![]() |
![]() |