物理化学学报 >> 2011, Vol. 27 >> Issue (07): 1654-1660.doi: 10.3866/PKU.WHXB20110735

理论与计算化学 上一篇    下一篇

基于支持向量机的高维特征非线性快速筛选与肽QSAR建模

代志军1,2, 周玮1, 袁哲明1,2   

  1. 1. 湖南农业大学生物安全科学技术学院, 长沙 410128;
    2. 湖南农业大学, 湖南省作物种质创新与资源利用重点实验室, 长沙 410128
  • 收稿日期:2011-03-24 修回日期:2011-05-10 发布日期:2011-06-28
  • 通讯作者: 袁哲明 E-mail:zhmyuan@sina.com
  • 基金资助:

    湖南省杰出青年科学基金(10JJ1005), 高等学校博士点基金(200805370002)和湖南省2008年高校科技创新团队项目资助

A Novel Method of Nonlinear Rapid Feature Selection for High Dimensional Data and Its Application in Peptide QSAR Modeling Based on Support Vector Machine

DAI Zhi-Jun1,2, ZHOU Wei1, YUAN Zhe-Ming1,2   

  1. 1. College of Bio-safety Science and Technology, Hunan Agricultural University, Changsha 410128, P. R. China;
    2. Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, P. R. China
  • Received:2011-03-24 Revised:2011-05-10 Published:2011-06-28
  • Contact: YUAN Zhe-Ming E-mail:zhmyuan@sina.com
  • Supported by:

    The project was supported by the Science Foundation for Distinguished Young Scholars of Hunan Province, China (10JJ1005), Research Fund for the Doctoral Program of Higher Education of China (200805370002), Team Project for the Technology Innovation of Higher Education of Hunan Province, China, 2008.

摘要:

以氨基酸的531个物理化学性质参数直接表征肽的结构, 基于支持向量回归发展了一种新的高维特征非线性快速筛选方法, 将其应用于苦味二肽和血管紧张素转化酶抑制剂2个肽体系的定量序效关系(QSAR)建模, 各筛选获得10个意义明确的保留描述子. 以保留描述子建立支持向量回归模型, 其拟合精度、留一法交叉测试精度和外部预测精度较文献报道结果均有较大幅度提升, 优势明显; 对所建模型进行了非线性回归显著性测验、单因子相对重要性显著性测验和单因子效应分析, 增强了模型的可解释性. 新方法在肽、蛋白质QSAR建模等高维数据回归预测领域有广泛应用前景.

关键词: 高维特征, 特征选择, 肽, 定量序效关系, 支持向量机

Abstract:

Each amino acid residue of one peptide was characterized directly by 531 physicochemical property parameters. Based on support vector regression (SVR) we developed a new nonlinear rapid feature selection method for high dimensional data, which was applied to a quantitative sequence- activity relationship (QSAR) study of two peptide systems (bitter tasting thresholds and angiotensin converting enzyme inhibitors). In both systems, 10 descriptors with clear meaning were reserved. We established a SVR model for both peptide systems using the reserved descriptors of the peptides. For both models the accuracies of fitting, the leave-one-out cross validation, and the external prediction improved significantly compared with the results reported in literature. To enhance the interpretability of the models, significance tests of the nonlinear regression model, single-factor relative importance, and a single-factor effect analysis were carried out. The new method has broad application prospects for regression forecasting of high dimensional data such as QSAR modeling of peptide or proteins.

Key words: High dimensional feature, Feature selection, Peptide, Quantitative sequence-activity relationship, Support vector machine