For our weekly journal club I talked about a new method for de novo protein folding called EVfold. [Slides] Details can be read in the paper (plus 15 page supporting text)

Marks, D. S., Colwell, L. J., Sheridan, R., Hopf, T. A., Pagnani, A., Zecchina, R., & Sander, C. (2011). Protein 3D Structure Computed from Evolutionary Sequence Variation. PloS ONE, 6(12), e28766. doi:10.1371/journal.pone.0028766.t001

The authors are motivated by two observations:

“In spite of significant progress in the field of structural genomics over the last decade [20], only about half of all well-characterized protein families (PFAM-A, 12,000 families), have a 3D structure for any of their members [1].”

“As we are about to reach a truly explosive phase of massively parallel sequencing, we anticipate increased coverage of sequence space for protein families by several orders of magnitude, well above the level of 1000–10000 non-redundant sequences for protein family and with rich evolutionary information about protein structure directly from sequence.”

Basically, DNA sequencing is dirt cheap and will only get cheaper, but up until now this hasn’t been helping to solve protein structures.

Marks et al. try to remedy this situation by looking at co-evolving residue pairs. Basically, they hypothesize that residues which are located close together in 3D space will tend to evolve together. If one mutates to a smaller residue, the other will tend to mutate to something bigger to compensate. If one changes from positively charged to negative, the other will change from negative to positive to balance it out. The idea behind EVfold is to identify co-evolving residues from the thousands of sequences we have for some protein families, then use that information to provide distance constraints in order to predict the protein’s structure.

Of course, just because two residues co-vary doesn’t necessarily imply they are spatially close. They could indirectly influence each other, such as if both bind to a ligand or both bind some intermediate residue. So the authors use a technique called direct coupling analysis (DCA) to predict which residues are close together. This has been around for a few years (Weigt et al (2009). PNAS, 106(1), 67–72), although that’s not immediately clear form the paper. DCA assigns a quantity called direct information (DI) to each pair of residues, which correlates really well with whether the pair is close together.

Marks et al. figure S2c. Grey regions indicate residues of Ras protein which are close together in the crystal structure, while red dots indicate pairs which were predicted to be close based on DI.

EVfold takes the top-ranked residue pairs and assumes they are close together. It then uses those pairs as distance constraints to solve the structure. This is identical to using distance constraints from NMR to solve a structure, and uses well-know simulated annealing/molecular dynamics algorithms. At the end, you get lovely protein strucutures with 3-5Å RMSD from the crystal structure.

Marks et al. figure 2. Predicted (left) and observed (right) structures for three proteins. A few minor differences are visible, such as missing beta-strands, but all three predictions are correct overall.

Perhaps the most impressive fact about this is that EVfold is able to predict a structure in less than an hour from only sequence information. That is incredible compared with the days of supercomputer time needed for other ab initio methods like ROSETTA.

So has EVfold solved the structure prediction problem? Hardly. There are many proteins where finding 1000+ homologous sequences will be hard, even with advances in sequencing technology (vertebrate-only proteins, for instance). Also, the authors suggest that even with perfect distance constraints the simulated annealing methods will not be able to predict structures at less that 2Å. So major advances at refining structures are needed before the crystallographers will be out of a job.

Still, there are lots of applications for which 3-5Å models of widespread folds would be useful. For instance, one of the major difficulties I’ve run into in my work on fold space is that we know there are thousands of proteins which are dissimilar to all known structures. Do these represent new folds, or are they just more variants of existing known folds? The speed of EVfold means that it should be fairly easy to predict structures for all of these domains which have enough sequence information out there. That’s not as good as having experimentally determined structures for everything, but it could give us some intriguing insights into the completeness of protein fold space.

Leave a Reply

Your email address will not be published. Required fields are marked *