CFCS Youth Talks

Machine Learning in Protein Structure Prediction

  • Sheng Wang, King Abdullah University of Science and Technology
  • Time: 2018-04-02 16:10
  • Host: CFCS
  • Venue: Room 101, Courtyard No.5, Jingyuan


Recently, large progress has been achieved in ab initio protein folding using restraints from 2D predicted contacts and 1D predicted secondary structure. However, the reliability of the folded 3D structure largely depends on accurate contact and secondary structure prediction, which by existing methods can only be achieved on some large-sized protein families with thousands of sequence homologs. Here we employ the emerging Deep Learning technique from Computer Science, a powerful technique that can learn complex patterns from large datasets. In particular, our approach for contact prediction differs from existing methods mainly in (1) formulating contact prediction as a pixel-level image labeling problem instead of an image-level classification problem; (2) simultaneously predicting all contacts of an individual protein to make effective use of contact occurrence patterns; and (3) integrating both one-dimensional and two-dimensional deep convolutional neural networks to effectively learn complex sequence-structure relationship including high-order residue correlation. The 1D deep network could be directly applied to predict secondary structure. The result of our one-dimensional deep convolutional neural networks achieved the state-of-the-art accuracy of ~84% of protein secondary structure prediction that breaking the long-lasting ~80% accuracy for decades of years. Our contact prediction method performed the best in CASP12 in terms of the F1 score of 38 free-modeling targets. After CASP12, we have been testing our method in a fully automated and online blind test CAMEO, in which we successfully ab initio predicted 10 proteins with a novel fold. Finally, we demonstrated that a deep transfer learning method could be easily applied to predict membrane protein structures.


I'm now a Research Scientist at King Abdullah University of Science and Technology (KAUST). Previously, I was a joint Research Professional at Department of Human Genetics in University of Chicago, and Toyota Technological Institute at Chicago. I obtained my Ph.D. under the supervision of Wei-Mou Zheng from Institute of Theoretical Physics, Chinese Academy of Sciences. I got my Bachelor degree in School of Life Sciences and Biotechnology, Shanghai Jiao Tong University.


My research interest lies in the pipeline that initiates from machine learning models to computational biology algorithms and terminates at applications in biological problems. Specifically, my research interests focus on the data analysis and interpretation with respect to DNA, RNA, and protein. In the past years, I have devoted myself to studying models and algorithms for learning from big and imbalanced data, which are then applied for predicting protein structure and function. My current research interests are also focused on machine-learning-based next generation sequencing and structure-based protein function analysis.