物理化学学报 >> 2009, Vol. 25 >> Issue (12): 2558-2564.doi: 10.3866/PKU.WHXB20091122

研究论文 上一篇    下一篇

蛋白质折叠类型的分类建模与识别

刘岳, 李晓琴, 徐海松, 乔辉   

  1. 北京工业大学生命科学与生物工程学院, 北京 100124
  • 收稿日期:2009-05-07 修回日期:2009-08-28 发布日期:2009-11-27
  • 通讯作者: 李晓琴 E-mail:lxq0811@bjut.edu.cn

Classification Modeling and Recognition of Protein Fold Type

LIU Yue, LI Xiao-Qin, XU Hai-Song, QIAO Hui   

  1. School of Life Science and Bioengineering, Beijing University of Technology, Beijing 100124, P. R. China
  • Received:2009-05-07 Revised:2009-08-28 Published:2009-11-27
  • Contact: LI Xiao-Qin E-mail:lxq0811@bjut.edu.cn

摘要:

蛋白质的氨基酸序列如何决定空间结构是当今生命科学研究中的核心问题之一. 折叠类型反映了蛋白质核心结构的拓扑模式, 折叠识别是蛋白质序列-结构研究的重要内容. 我们以占Astral 1.65序列数据库中α, β和α/β三类蛋白质总量41.8%的36个无法独立建模的折叠类型为研究对象, 选取其中序列一致性小于25%的样本作为训练集, 以均方根偏差(RMSD)为指标分别进行系统聚类, 生成若干折叠子类, 并对各子类建立基于多结构比对算法(MUSTANG)结构比对的概形隐马尔科夫模型(profile-HMM). 将Astral 1.65中序列一致性小于95%的9505个样本作为检验集, 36个折叠类型的平均识别敏感性为90%, 特异性为99%, 马修斯相关系数(MCC)为0.95. 结果表明: 对于成员较多, 无法建立统一模型的折叠类型, 基于RMSD的系统分类建模均可实现较高准确率的识别, 为蛋白质折叠识别拓展了新的方法和思路, 为进一步研究奠定了基础.

关键词: 蛋白质折叠类型, 均方根偏差, 系统聚类, 隐马尔科夫模型, 折叠识别

Abstract:

The mechanism of how protein amino acid sequences determine protein structure is a core issue in biology. The protein fold type reflects the topological pattern of the structure's core. Fold recognition is an important method in protein sequence-structure research. This article focuses on the 36 fold types that are not incorporated into the unified hidden Markov model (HMM) model but that account for 41.8% of α, β, and α/β protein's in the Astral 1.65 sequence database. The training set contains samples that have less than 25% sequence identity with each other. We applied the hierarchical clustering method according to root mean square deviation (RMSD) and fold subgroups were generated. A profile-HMM based on a multiple structural alignment algorithm (MUSTANG) structure alignment was then built for each subgroup. After testing 9505 proteins with less than 95% sequence identity from the Astral 1.65 database, the average sensitivity, specificity and Matthew's correlation coefficient (MCC) of the 36 fold types were found to be 90%, 99% and 0.95, respectively. These results show that classification modeling according to RMSD is able to achieve precise fold recognition while a unified HMM cannot be built because there are too many elements in the training set. We have developed a new method and novel ideas to enable profile-HMMprotein fold recognition and have laid the foundation for further research.

Key words: Protein fold type, RMSD, Hierarchical clustering, Profile-HMM, Fold recognition

MSC2000: 

  • O641