Acta Phys. -Chim. Sin. ›› 2011, Vol. 27 ›› Issue (06): 1407-1416.doi: 10.3866/PKU.WHXB20110608


Prediction of Hepatitis C Virus Non-Structural Proteins 5B Polymerase Inhibitors Using Machine Learning Methods

Lü Wei1, XUE Ying2,3   

  1. 1. College of Life Sciences, State Key Laboratory of Crop Biology, Shandong Agricultural University, Tai′an 271018, Shandong Province, P. R. China;
    2. College of Chemistry, Key Laboratory of Green Chemistry and Technology, Ministry of Education, Sichuan University, Chengdu 610064, P. R. China;
    3. State Key Laboratory of Biotherapy, Sichuan University, Chengdu 610041, P. R. China
  • Received:2011-03-02 Revised:2011-03-29 Published:2011-05-31
  • Contact: XUE Ying
  • Supported by:

    The project was supported by the National Key Basic Research Program of China (2009CB118500) and Scientific Research Foundation for the Returned Overseas Chinese Scholars, Ministry of Education, China (20071108-18-15).


Non-structural proteins 5B (NS5B) play an important role in protein maturation and gene replication as an RNA dependent RNA polymerase in the hepatitis C virus (HCV). Inhibiting NS5B polymerase will prevent RNA replication and, therefore, it is significant for the treatment of HCV. It is becoming increasingly important to screen and predict molecules that have NS5B inhibitory activity by computational methods. This work explores several machine learning (ML) methods (support vector machine (SVM), k-nearest neighbor (k-NN), and C4.5 decision tree (C4.5 DT)) for the prediction of NS5B inhibitors (NS5BIs). This prediction system was tested using 1248 compounds (552 NS5BIs and 696 non- NS5BIs), which are significantly more diverse in chemical structure than those used in other studies. A feature selection method was used to improve the prediction accuracy and the selection of molecular descriptors responsible for distinguishing between NS5BIs and non-NS5BIs. The prediction accuracies were 81.4%-91.7% for the NS5BIs, 78.2%-87.2% for the non-NS5BIs, and 84.1%-85.0% overall based on the three kinds of machine learning methods. SVM gave the best accuracy of 91.7% for the NS5BIs, C4.5 gave the best accuracy of 87.2% for the non-NS5BIs, and k-NN gave the best overall accuracy of 85.0% for all the compounds. This work suggests that machine learning methods can facilitate the prediction of the NS5BIs potential for unknown sets of compounds and to determine the molecular descriptors associated with NS5BIs.

Key words: Machine learning method, Molecular descriptor, Recursive feature elimination, Support vector machine, Hepatitis C virus


  • O641