物理化学学报 >> 2013, Vol. 29 >> Issue (03): 498-507.doi: 10.3866/PKU.WHXB201301042

理论与计算化学 上一篇    下一篇

基于岭回归和SVM的高维特征选择与肽QSAR建模

王志明1,2,3, 韩娜1,2, 袁哲明1,2, 伍朝华3   

  1. 1 湖南省作物种质创新与资源利用重点实验室, 长沙 410128;
    2 湖南省植物病虫害生物学与防控重点实验室, 长沙 410128;
    3 湖南农业大学理学院, 长沙 410128
  • 收稿日期:2012-09-24 修回日期:2013-01-02 发布日期:2013-02-25
  • 通讯作者: 袁哲明 E-mail:zhmyuan@sina.com
  • 基金资助:

    湖南省杰出青年科学基金(10JJ1005)和教育部博士点基金(20124320110002)资助项目

Feature Selection for High-Dimensional Data Based on Ridge Regression and SVM and Its Application in Peptide QSAR Modeling

WANG Zhi-Ming1,2,3, HAN Na1,2, YUAN Zhe-Ming1,2, WU Zhao-Hua3   

  1. 1 Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, P. R. China;
    2 Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha, 410128, P. R. China;
    3 College of Science, Hunan Agricultural University, Changsha 410128, P. R. China
  • Received:2012-09-24 Revised:2013-01-02 Published:2013-02-25
  • Supported by:

    The project was supported by the Science Foundation for Distinguished Young Scholars of Hunan Province, China (10JJ1005) and Specialized Research Fund for the Doctoral Program of Higher Education, China (20124320110002).

摘要:

岭回归估计权重绝对值在一定程度上体现了对应特征作用大小, 据此发展了基于岭回归(RR)和支持向量机(SVM)的高维特征选择算法. 对苦味二肽(BTT)和细胞毒性T淋巴细胞(CTL)表位9 肽两个肽体系, 以氨基酸的531 个物理化学性质参数直接表征肽结构, 各获得1062、4779 个初始特征; 对训练集, 初始特征以岭回归排序后序贯引入, 当SVM留一法交叉测试(LOOCV)的均方误差(MSE)显著上扬时终止, 最后以多轮末尾淘汰进一步精筛, 分别获得7、18个物理化学意义明确的保留特征. 基于保留特征与支持向量回归(SVR), 对训练集建立定量构效关系(QSAR)模型, 预测独立测试集, 其拟合精度、留一法交叉测试精度、独立预测精度均优于现有文献报道结果. 新方法运行速度快, 选取的特征物理化学意义明确, 解释性强, 在肽、蛋白质定量构效关系建模等高维数据回归预测领域有较广泛应用前景.

关键词: 定量构效关系, 岭回归, 支持向量机, 特征选择, 高维特征

Abstract:

Absolute weight values estimated from test data by ridge regression (RR) can reflect the significance of corresponding features. Based on RR and support vector machine (SVM), a new feature selection algorithm for high-dimensional data is proposed. Examples from bitter tasting thresholds (BTT) and cytotoxic T lymphocyte (CTL) epitopes are presented. All 531 physicochemical property parameters were employed to express each residue of one peptide, thus 1062 and 4779 descriptors were obtained for BTT and CTL, respectively. Each sample was divided into training and test sets, and weight estimates of all training set descriptors were generated by RR. According to the descending order of the weights, corresponding features were gradually selected until the mean square error (MSE) of leave-one-out cross validation (LOOCV) increased significantly. Based on smaller training datasets obtained from the previous step, the reserved features were available from multiple elimination rounds. 7 and 18 descriptors were selected by the new method for BTT and CTL, respectively. A quantitative structure-activity relationship (QSAR) model based on support vector regression (SVR) was established on extracted data with the reserved descriptors, and was then used for test data prediction. The fitting, LOOCV, and external prediction accuracies were significantly improved with respect to reported literature values. Because of the calculation speed, clear physicochemical meaning, and ease of interpretation, the new method is widely applicable to regression forecasting of high-dimensional data such as QSAR modeling of peptide or proteins.

Key words: Quantitative structure-activity relationship, Support vector machine, Ridge regression, Feature selection, High-dimensional feature

MSC2000: 

  • O641