Acta Phys. -Chim. Sin. ›› 2011, Vol. 27 ›› Issue (07): 1654-1660.doi: 10.3866/PKU.WHXB20110735

• THEORETICAL AND COMPUTATIONAL CHEMISTRY • Previous Articles     Next Articles

A Novel Method of Nonlinear Rapid Feature Selection for High Dimensional Data and Its Application in Peptide QSAR Modeling Based on Support Vector Machine

DAI Zhi-Jun1,2, ZHOU Wei1, YUAN Zhe-Ming1,2   

  1. 1. College of Bio-safety Science and Technology, Hunan Agricultural University, Changsha 410128, P. R. China;
    2. Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, P. R. China
  • Received:2011-03-24 Revised:2011-05-10 Published:2011-06-28
  • Contact: YUAN Zhe-Ming E-mail:zhmyuan@sina.com
  • Supported by:

    The project was supported by the Science Foundation for Distinguished Young Scholars of Hunan Province, China (10JJ1005), Research Fund for the Doctoral Program of Higher Education of China (200805370002), Team Project for the Technology Innovation of Higher Education of Hunan Province, China, 2008.

Abstract:

Each amino acid residue of one peptide was characterized directly by 531 physicochemical property parameters. Based on support vector regression (SVR) we developed a new nonlinear rapid feature selection method for high dimensional data, which was applied to a quantitative sequence- activity relationship (QSAR) study of two peptide systems (bitter tasting thresholds and angiotensin converting enzyme inhibitors). In both systems, 10 descriptors with clear meaning were reserved. We established a SVR model for both peptide systems using the reserved descriptors of the peptides. For both models the accuracies of fitting, the leave-one-out cross validation, and the external prediction improved significantly compared with the results reported in literature. To enhance the interpretability of the models, significance tests of the nonlinear regression model, single-factor relative importance, and a single-factor effect analysis were carried out. The new method has broad application prospects for regression forecasting of high dimensional data such as QSAR modeling of peptide or proteins.

Key words: High dimensional feature, Feature selection, Peptide, Quantitative sequence-activity relationship, Support vector machine