物理化学学报 >> 2009, Vol. 25 >> Issue (08): 1587-1592.doi: 10.3866/PKU.WHXB20090752

研究论文 上一篇    下一篇

基于地统计学与支持向量回归的QSAR建模

陈渊, 袁哲明, 周玮, 熊兴耀   

  1. 湖南农业大学生物安全科学技术学院, 长沙 410128|湖南农业大学, 湖南省作物种质创新与资源利用重点实验室, 长沙 410128
  • 收稿日期:2009-03-16 修回日期:2009-04-15 发布日期:2009-07-16
  • 通讯作者: 袁哲明 E-mail:zhmyuan@sina. com

A Novel QSAR Model Based on Geostatistics and Support Vector Regression

CHEN Yuan, YUAN Zhe-Ming, ZHOU Wei, XIONG Xing-Yao   

  1. College of Bio-safety Science and Technology, Hunan AgriculturalUniversity, Changsha 410128, P. R. China|Hunan Provincial Key Laboratory of Crop GermplasmInnovation andUtilization, Hunan AgriculturalUniversity, Changsha 410128, P. R. China
  • Received:2009-03-16 Revised:2009-04-15 Published:2009-07-16
  • Contact: YUAN Zhe-Ming E-mail:zhmyuan@sina. com

摘要:

基于主成分分析(PCA)、地统计学(GS)和支持向量回归(SVR), 提出了一种新的定量构效关系(QSAR)个体化预测方法——Weight-PCA-GS-SVR. 其基本思路是: 先以PCA降维并消除自变量间的信息冗余, 继以SVR经非线性主成分筛选去除与因变量无关的主成分, 再以保留主成分计算样本间的加权距离, 然后以高维GS确定公用变程; 每一个待测样本都以自身为中心从训练集中找出加权距离小于公用变程的私有k个近邻, 以SVR训练建模完成个体化预测. Weight-PCA-GS-SVR从行、列两个方向对模型进行了优化, 为自变量提供了一种新的加权方法, 为解决最优k近邻选择难题提供了新的思路, 并具有SVR原来的优点. 经3个化合物活性实例数据集验证, 新方法在所有参比模型中预测精度最高, 且明显优于文献报道结果, Weight-PCA-GS-SVR在QSAR等回归预测领域有较广泛的应用前景.

关键词: 定量构效关系, 地统计学, 支持向量回归, 主成分分析, 个体化预测

Abstract:

Based on principal component analysis (PCA), geostatistics (GS) and support vector regression (SVR), a novel individual forecasting method for quantitative structure-activity relationship (QSAR)——Weight-PCA-GS-SVR was proposed. The basic principles were as follows: firstly, dimensions were reduced and redundant information from independent descriptors was eliminated using PCA; secondly, the principal components that have no relationship to activity were removed nonlinearly using SVR; thirdly, weighted distances between samples were calculated by the retained principal components; fourthly, a common range was confirmed using high-dimensional geostatistics; lastly, k nearest neighbors of each test sample were found from the training set with their weighted distances shorter than a common range and then the models were constructed and the individual prediction was found to be feasible using SVR. Weight-PCA-GS-SVR optimized the model along the column direction (descriptor) and row direction (sample), and had all the advantages of SVR. It therefore provides a newway to choose k nearest neighbors in the field as well as being a novel weighted method for determining the retained principal components or the retained descriptors. Predicted results from three data sets all verify that the novel method has the highest prediction precision among all reference models and has a remarkable advantage over reported results. Weight-PCA-GS-SVR, therefore, can be widely used in QSAR and other regression prediction fields.

Key words: Quantitative structure-activity relationship, Geostatistics, Support vector regression, Principal component analysis, Individual prediction

MSC2000: 

  • O641