物理化学学报 >> 2013, Vol. 29 >> Issue (08): 1639-1647.doi: 10.3866/PKU.WHXB201305171

理论与计算化学 上一篇    下一篇

基于机器学习方法的丙型肝炎病毒聚合酶NS5B非核苷抑制剂的定量构效关系研究

丛湧1, 薛英1,2   

  1. 1 四川大学化学学院, 教育部绿色化学与技术重点实验室, 成都 610064;
    2 西华大学四川省先进科学计算重点实验室, 成都 610039
  • 收稿日期:2013-02-01 修回日期:2013-05-17 发布日期:2013-07-09
  • 通讯作者: 薛英 E-mail:yxue@scu.edu.cn
  • 基金资助:

    国家自然科学基金(21173151)和西华大学先进科学计算省重点实验室开放基金(szjj2011-029)资助项目

Quantitative Structure-Activity Relationship Study of the Non-Nucleoside Inhibitors of HCV NS5B Polymerase by Machine Learning Methods

CONG Yong1, XUE Ying1,2   

  1. 1 College of Chemistry, Key Laboratory of Green Chemistry and Technology, Ministry of Education, Sichuan University, Chengdu 610064, P. R. China;
    2 State Key Laboratory of Biotherapy, Sichuan University, Chengdu 610041, P. R. China
  • Received:2013-02-01 Revised:2013-05-17 Published:2013-07-09
  • Contact: XUE Ying E-mail:yxue@scu.edu.cn
  • Supported by:

    The project was supported by the National Natural Science Foundation of China (21173151) and Open Research Fund of Key Laboratory of Advanced Scientific Computation, Xihua University, China (szjj2011-029).

摘要:

对89 个苯并异噻唑和苯并噻嗪类丙型肝炎病毒(HCV) NS5B聚合酶非核苷抑制剂进行了定量构效关系(QSAR)研究. 采用遗传算法组合偏最小二乘(GA-PLS)和线性逐步回归分析(LSRA)两种特征选择方法选择最优描述符子集, 然后建立多元线性回归和偏最小二乘线性回归模型. 并首次尝试使用遗传算法耦合支持向量机方法(GA-SVM)对两种特征选择方法所选的描述符子集分别建立非线性支持向量机回归模型. 三种机器学习方法所建模型均得到比较满意的预测效果. 采用LSRA所选的6 个描述符建立的三个QSAR模型对于测试集的相关系数为0.958-0.962, GA-SVM法给出最好的预测精度(0.962). 采用GA-PLS所选的7个描述符建立的三个QSAR模型对于测试集的相关系数为0.918-0.960, 偏最小二乘回归模型的结果最好(0.960). 本工作提供了一种有效的方法来预测丙型肝炎病毒抑制剂的生物活性, 该方法也可以扩展到其他类似的定量构效关系研究领域.

关键词: 丙型肝炎病毒NS5B聚合酶, 非核苷抑制剂, 线性逐步回归分析, 偏最小二乘法, 遗传算法, 支持向量机

Abstract:

The quantitative structure-activity relationship (QSAR) approach was used to predict the activity of two different scaffolds (benzoisothiazole and benzothiazine) of 89 non-nucleoside inhibitors of hepatitis c virus (HCV) NS5B polymerase. Two selection methods, linear stepwise regression analysis (LSRA) and genetic algorithm-partial least squares (GA-PLS), were used to select appropriate descriptor subsets for QSAR modeling with linear models. The genetic algorithm-support vector machine (GA-SVM) approach was first used to build nonlinear models with six LSRA- and seven GA-PLS-selected descriptors. Three QSAR models built with the six LSRA-selected descriptors gave correlation coefficients of 0.958-0.962 for the training set. GA-SVM provided the highest prediction accuracy of the models of 0.962. Three QSAR models built with the seven GA-PLS-selected descriptors gave correlation coefficients of 0.918-0.960 for the training set, of which the partial least squares (PLS) model was the best (0.960). The investigated models gave satisfactory prediction results and can be extended to other QSAR studies.

Key words: HCV NS5B polymerase, Non-nucleoside Inhibitor, Linear stepwise regression analysis, Partial least square, Genetic algorithm, Support vector machine