Acta Phys. -Chim. Sin. ›› 2014, Vol. 30 ›› Issue (6): 1091-1098.doi: 10.3866/PKU.WHXB201404091

• THEORETICAL AND COMPUTATIONAL CHEMISTRY • Previous Articles     Next Articles

Predicting the Protein Folding Rate Based on Sequence Feature Screening and Support Vector Regression

LI Yong, ZHOU Wei, DAI Zhi-Jun, CHEN Yuan, WANG Zhi-Ming, YUAN Zhe-Ming   

  1. Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Hunan Agricultural University, Changsha 410128, P. R. China; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha 410128, P. R. China
  • Received:2013-12-12 Revised:2014-04-07 Published:2014-05-26
  • Contact: YUAN Zhe-Ming E-mail:zhmyuan@sina.com
  • Supported by:

    The project was supported by the Specialized Research Fund for the Doctoral Program of Higher Education, China (20124320110002), National Natural Science Foundation of China (31301388), and Natural Science Foundation of Hunan Province, China (14JJ3092).

Abstract:

Folding rate prediction plays an important role in clarifying the protein folding mechanism. In this work, we collected 115 protein samples with known folding rates including two-, multi-, and mixed-state proteins. To characterize the primary structure information of the protein molecules more comprehensively, we considered sequence length, residue components with different scales, k-space features for pair residues, and geostatistics association features among different locations of the residues substituted with corresponding physical-chemical properties. Each protein sequence was represented by a numeric vector containing 9357 numbers. We selected 23 features with a clear meaning from the above-mentioned high-dimensional features for each sample, after conducting an improved binary matrix shuffling filter and a worst descriptor elimination multi-round method. We constructed a nonlinear support vector regression (SVR) model based on the folding rate and the 23 retained features. The correlation coefficient of the Jackknife cross validation was 0.95. Our prediction accuracy was superior to other results from the literature and other reference feature selection methods. Finally, we established an interpretability system for SVR, and our data showed that the nonlinear regression relationship between the folding rates and the reserved features was highly significant. By further analyzing the effects of each retained descriptor on protein folding rates, the results showed that the protein folding rate might be closely related to the sequence length, the features associated with the medium-and short-range, the triplet residues component features, etc.

Key words: Protein folding, Folding rate prediction, High-dimensional feature, Feature screening, Support vector regression

MSC2000: 

  • O641