My first paper in collaboration with the Capitani team has been published online:

Kumaran Baskaran, Jose M Duarte, Nikhil Biyani, Spencer Bliven and Guido Capitani. (2014)
A PDB-wide, evolution-based assessment of protein–protein interfaces
BMC Structural Biology. 14:22 doi:10.1186/s12900-014-0022-0

It’s been really rewarding working with the team at PSI this year, and it’s great to see our work come to fruition! Here’s the abstract:


Thanks to the growth in sequence and structure databases, more than 50 million sequences are now
available in UniProt and 100,000 structures in the PDB. Rich information about protein–protein
interfaces can be obtained by a comprehensive study of protein contacts in the PDB, their sequence conservation and geometric features.


An automated computational pipeline was developed to run our Evolutionary Protein–Protein
Interface Classifier (EPPIC) software on the entire PDB and store the results in a relational database,
currently containing > 800,000 interfaces. This allows the analysis of interface data on a PDB-wide
scale. Two large benchmark datasets of biological interfaces and crystal contacts, each containing
about 3000 entries, were automatically generated based on criteria thought to be strong indicators of
interface type. The BioMany set of biological interfaces includes NMR dimers solved as crystal
structures and interfaces that are preserved across diverse crystal forms, as catalogued by the Protein
Common Interface Database (ProtCID) from Xu and Dunbrack. The second dataset, XtalMany, is
derived from interfaces that would lead to infinite assemblies and are therefore crystal contacts.
BioMany and XtalMany were used to benchmark the EPPIC approach. Theperformance of EPPIC
was also compared to classifications from the Protein Interfaces, Surfaces, and Assemblies (PISA)
program on a PDB-wide scale, finding that the two approaches give the same call in about 85% of
PDB interfaces. By comparing our safest predictions to the PDB author annotations, we provide a
lower-bound estimate of the error rate of biological unit annotations in the PDB. Additionally, we
developed a PyMOL plugin for direct download and easy visualization of EPPIC interfaces for any
PDB entry. Both the datasets and the PyMOL plugin are available at


Our computational pipeline allows us to analyze protein–protein contacts and their sequence
conservation across the entire PDB. Two new benchmark datasets are provided, which are over an
order of magnitude larger than existing manually curated ones. These tools enable the
comprehensive study of several aspects of protein–protein contacts in
the PDB and represent a basis
for future, even larger scale studies of protein–protein interactions.

Leave a Reply

Your email address will not be published. Required fields are marked *