Acta Phys. -Chim. Sin. ›› 2013, Vol. 29 ›› Issue (03): 498-507.doi: 10.3866/PKU.WHXB201301042

• THEORETICAL AND COMPUTATIONAL CHEMISTRY • Previous Articles     Next Articles

Feature Selection for High-Dimensional Data Based on Ridge Regression and SVM and Its Application in Peptide QSAR Modeling

WANG Zhi-Ming1,2,3, HAN Na1,2, YUAN Zhe-Ming1,2, WU Zhao-Hua3   

  1. 1 Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, P. R. China;
    2 Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha, 410128, P. R. China;
    3 College of Science, Hunan Agricultural University, Changsha 410128, P. R. China
  • Received:2012-09-24 Revised:2013-01-02 Published:2013-02-25
  • Supported by:

    The project was supported by the Science Foundation for Distinguished Young Scholars of Hunan Province, China (10JJ1005) and Specialized Research Fund for the Doctoral Program of Higher Education, China (20124320110002).

Abstract:

Absolute weight values estimated from test data by ridge regression (RR) can reflect the significance of corresponding features. Based on RR and support vector machine (SVM), a new feature selection algorithm for high-dimensional data is proposed. Examples from bitter tasting thresholds (BTT) and cytotoxic T lymphocyte (CTL) epitopes are presented. All 531 physicochemical property parameters were employed to express each residue of one peptide, thus 1062 and 4779 descriptors were obtained for BTT and CTL, respectively. Each sample was divided into training and test sets, and weight estimates of all training set descriptors were generated by RR. According to the descending order of the weights, corresponding features were gradually selected until the mean square error (MSE) of leave-one-out cross validation (LOOCV) increased significantly. Based on smaller training datasets obtained from the previous step, the reserved features were available from multiple elimination rounds. 7 and 18 descriptors were selected by the new method for BTT and CTL, respectively. A quantitative structure-activity relationship (QSAR) model based on support vector regression (SVR) was established on extracted data with the reserved descriptors, and was then used for test data prediction. The fitting, LOOCV, and external prediction accuracies were significantly improved with respect to reported literature values. Because of the calculation speed, clear physicochemical meaning, and ease of interpretation, the new method is widely applicable to regression forecasting of high-dimensional data such as QSAR modeling of peptide or proteins.

Key words: Quantitative structure-activity relationship, Support vector machine, Ridge regression, Feature selection, High-dimensional feature