物理化学学报 >> 2011, Vol. 27 >> Issue (06): 1407-1416.doi: 10.3866/PKU.WHXB20110608

理论与计算化学 上一篇    下一篇

基于机器学习方法的丙型肝炎病毒非结构蛋白5B聚合酶抑制剂活性预测

吕巍1, 薛英2,3   

  1. 1. 山东农业大学生命科学学院, 作物生物学国家重点实验室, 山东 泰安 271018;
    2. 四川大学化学学院, 教育部绿色化学与技术重点实验室, 成都 610064;
    3. 四川大学生物治疗国家重点实验室, 成都 610041
  • 收稿日期:2011-03-02 修回日期:2011-03-29 发布日期:2011-05-31
  • 通讯作者: 薛英 E-mail:xue@scu.edu.cn
  • 基金资助:

    国家重点基础研究发展规划项目(2009CB118500)和教育部留学归国人员科研启动基金(20071108-18-15)

Prediction of Hepatitis C Virus Non-Structural Proteins 5B Polymerase Inhibitors Using Machine Learning Methods

Lü Wei1, XUE Ying2,3   

  1. 1. College of Life Sciences, State Key Laboratory of Crop Biology, Shandong Agricultural University, Tai′an 271018, Shandong Province, P. R. China;
    2. College of Chemistry, Key Laboratory of Green Chemistry and Technology, Ministry of Education, Sichuan University, Chengdu 610064, P. R. China;
    3. State Key Laboratory of Biotherapy, Sichuan University, Chengdu 610041, P. R. China
  • Received:2011-03-02 Revised:2011-03-29 Published:2011-05-31
  • Contact: XUE Ying E-mail:xue@scu.edu.cn
  • Supported by:

    The project was supported by the National Key Basic Research Program of China (2009CB118500) and Scientific Research Foundation for the Returned Overseas Chinese Scholars, Ministry of Education, China (20071108-18-15).

摘要:

在丙型肝炎病毒(HCV)的基因复制和蛋白质成熟的过程中, 非结构蛋白5B(NS5B)作为RNA依赖的RNA聚合酶起到了重要的作用. 抑制NS5B聚合酶可以阻止丙型肝炎病毒的RNA复制, 因此成为一种治疗丙型肝炎的有效方法. 通过计算机方法进行虚拟筛选和预测NS5B聚合酶抑制剂已经变得越来越重要. 本文主要采用机器学习方法(支持向量机(SVM)、k-最近相邻法(k-NN)和C4.5决策树(C4.5 DT))对已知的丙型肝炎病毒NS5B蛋白酶抑制剂与非抑制剂建立分类预测模型. 1248个结构多样性化合物(552个NS5B抑制剂与696个非NS5B抑制剂)被用于测试分类预测系统, 并用递归变量消除法选择与NS5B抑制剂相关的性质描述符以提高预测精度. 独立验证集的总预测精度为84.1%-85.0%, NS5B抑制剂的预测精度为81.4%-91.7%, 非NS5B抑制剂的预测精度为78.2%-87.2%. 其中支持向量机给出最好的NS5B抑制剂预测精度(91.7%); C4.5决策树给出最好的非NS5B抑制剂预测精度(87.2%); k-最近相邻法给出最好的总预测精度(85.0%). 研究表明机器学习方法可以有效预测未知数据集中潜在的NS5B抑制剂, 并有助于发现与其相关的分子描述符.

关键词: 机器学习方法, 分子描述符, 递归变量消除法, 支持向量机, 丙型肝炎病毒

Abstract:

Non-structural proteins 5B (NS5B) play an important role in protein maturation and gene replication as an RNA dependent RNA polymerase in the hepatitis C virus (HCV). Inhibiting NS5B polymerase will prevent RNA replication and, therefore, it is significant for the treatment of HCV. It is becoming increasingly important to screen and predict molecules that have NS5B inhibitory activity by computational methods. This work explores several machine learning (ML) methods (support vector machine (SVM), k-nearest neighbor (k-NN), and C4.5 decision tree (C4.5 DT)) for the prediction of NS5B inhibitors (NS5BIs). This prediction system was tested using 1248 compounds (552 NS5BIs and 696 non- NS5BIs), which are significantly more diverse in chemical structure than those used in other studies. A feature selection method was used to improve the prediction accuracy and the selection of molecular descriptors responsible for distinguishing between NS5BIs and non-NS5BIs. The prediction accuracies were 81.4%-91.7% for the NS5BIs, 78.2%-87.2% for the non-NS5BIs, and 84.1%-85.0% overall based on the three kinds of machine learning methods. SVM gave the best accuracy of 91.7% for the NS5BIs, C4.5 gave the best accuracy of 87.2% for the non-NS5BIs, and k-NN gave the best overall accuracy of 85.0% for all the compounds. This work suggests that machine learning methods can facilitate the prediction of the NS5BIs potential for unknown sets of compounds and to determine the molecular descriptors associated with NS5BIs.

Key words: Machine learning method, Molecular descriptor, Recursive feature elimination, Support vector machine, Hepatitis C virus

MSC2000: 

  • O641