物理化学学报 >> 2010, Vol. 26 >> Issue (12): 3351-3359.doi: 10.3866/PKU.WHXB20101128

生物物理化学 上一篇    下一篇

机器学习方法用于建立乙酰胆碱酯酶抑制剂的分类模型

杨国兵2, 李泽荣1, 饶含兵1, 李象远2, 陈宇综3   

  1. 1. 四川大学化学学院, 成都610064;
    2. 四川大学化学工程学院, 成都610065;
    3. Department of Pharmacy, National University of Singapore, Singapore 117543
  • 收稿日期:2010-07-22 修回日期:2010-08-15 发布日期:2010-12-01
  • 通讯作者: 李泽荣 E-mail:lizerong@scu.edu.cn
  • 基金资助:

    国家自然科学基金(20973118)资助项目

Classification Models for Acetylcholinesterase Inhibitors Based on Machine Learning Methods

YANG Guo-Bing2, LI Ze-Rong1, RAO Han-Bing1, LI Xiang-Yuan2, CHEN Yu-Zong3   

  1. 1. College of Chemistry, Sichuan University, Chengdu 610064, P. R. China;
    2. College of Chemical Engineering, Sichuan University, Chengdu 610065, P. R. China;
    3. Department of Pharmacy, National University of Singapore, Singapore 117543
  • Received:2010-07-22 Revised:2010-08-15 Published:2010-12-01
  • Contact: LI Ze-Rong E-mail:lizerong@scu.edu.cn
  • Supported by:

    The project was supported by the National Natural Science Foundation of China (20973118).

摘要:

我们构建了表征乙酰胆碱酯酶抑制剂分子组成、电荷、拓扑、几何结构及物理化学性质等特征的1559 个描述符, 通过Fischer Score 排序过滤和Monte Carlo 模拟退火法相结合进行变量筛选得到37 个描述符, 然后分别用支持向量学习机(SVM)、人工神经网络(ANN)和k?近邻(k?NN)等机器学习方法建立了乙酰胆碱酯酶抑制剂的分类预测模型. 对于训练集的515个样本, 通过五重交叉验证, 各机器学习方法对正样本, 负样本和总样本的平均预测精度分别为87.3%-92.7%, 67.0%-81.0%和79.4%-88.2%; 通过y?scrambling 方法验证SVM模型是否偶然相关, 结果正样本, 负样本和总样本的平均预测精度分别为72.7%-82.5%, 41.0%-53.0%和62.1%-69.1%, 明显低于实际所建模型的预测精度, 表明所建模型不存在偶然相关; 对172个没有参与建模的外部独立测试样本, 各机器学习方法对正样本, 负样本和总样本的预测精度分别为93.3%-100.0%,74.6%-89.6%和86.1%-95.9%. 所建模型中, SVM模型预测精度最好, 且明显高于其它文献报道结果.

关键词: 乙酰胆碱酯酶抑制剂, 机器学习方法, 变量筛选, 应用域

Abstract:

A total of 1559 molecular descriptors including constitutional, charge distribution, topological, geometrical, and physicochemical descriptors were calculated to encode acetylcholinesterase inhibitors. The 37 molecular descriptors were selected using a hybrid filter/wrapper approach by combining a Fischer Score and Monte Carlo simulated annealing. Classification models for the acetylcholinesterase inhibitors were then built based on support vector machine (SVM), artificial neural networks (ANN), and k ?nearest neighbor (k?NN) methods. For the 515 samples in the training set, we obtained average prediction accuracies of 87.3%-92.7%, 67.0%-81.0%, and 79.4%-88.2% for the positive, the negative, and the total samples, respectively, by 5 ?fold cross validation. Average prediction accuracies of 72.7%-82.5%, 41.0%-53.0%, and 62.1%-69.1% were obtained for the positive, the negative, and the total samples, respectively, by the y?scrambling method, indicating that there was no chance correlation in our models. An external test was conducted on 172 samples that were not used for model building and we obtained prediction accuracies of 93.3%-100.0%, 74.6%-89.6%, and 86.1%-95.9% for the positive, the negative, and the total samples, respectively. The prediction accuracies obtained by all the machine learning methods especially by the SVM method were far better than previously reported results.

Key words: Acetylcholinesterase inhibitor, Machine learning method, Feature selection, Applicability domain

MSC2000: 

  • O641