Please wait a minute...
Acta Phys. -Chim. Sin.  2013, Vol. 29 Issue (03): 498-507    DOI: 10.3866/PKU.WHXB201301042
THEORETICAL AND COMPUTATIONAL CHEMISTRY     
Feature Selection for High-Dimensional Data Based on Ridge Regression and SVM and Its Application in Peptide QSAR Modeling
WANG Zhi-Ming1,2,3, HAN Na1,2, YUAN Zhe-Ming1,2, WU Zhao-Hua3
1 Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, P. R. China;
2 Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha, 410128, P. R. China;
3 College of Science, Hunan Agricultural University, Changsha 410128, P. R. China
Download:   PDF(1366KB) Export: BibTeX | EndNote (RIS)      

Abstract  

Absolute weight values estimated from test data by ridge regression (RR) can reflect the significance of corresponding features. Based on RR and support vector machine (SVM), a new feature selection algorithm for high-dimensional data is proposed. Examples from bitter tasting thresholds (BTT) and cytotoxic T lymphocyte (CTL) epitopes are presented. All 531 physicochemical property parameters were employed to express each residue of one peptide, thus 1062 and 4779 descriptors were obtained for BTT and CTL, respectively. Each sample was divided into training and test sets, and weight estimates of all training set descriptors were generated by RR. According to the descending order of the weights, corresponding features were gradually selected until the mean square error (MSE) of leave-one-out cross validation (LOOCV) increased significantly. Based on smaller training datasets obtained from the previous step, the reserved features were available from multiple elimination rounds. 7 and 18 descriptors were selected by the new method for BTT and CTL, respectively. A quantitative structure-activity relationship (QSAR) model based on support vector regression (SVR) was established on extracted data with the reserved descriptors, and was then used for test data prediction. The fitting, LOOCV, and external prediction accuracies were significantly improved with respect to reported literature values. Because of the calculation speed, clear physicochemical meaning, and ease of interpretation, the new method is widely applicable to regression forecasting of high-dimensional data such as QSAR modeling of peptide or proteins.



Key wordsQuantitative structure-activity relationship      Support vector machine      Ridge regression      Feature selection      High-dimensional feature     
Received: 24 September 2012      Published: 04 January 2013
MSC2000:  O641  
Fund:  

The project was supported by the Science Foundation for Distinguished Young Scholars of Hunan Province, China (10JJ1005) and Specialized Research Fund for the Doctoral Program of Higher Education, China (20124320110002).

Cite this article:

WANG Zhi-Ming, HAN Na, YUAN Zhe-Ming, WU Zhao-Hua. Feature Selection for High-Dimensional Data Based on Ridge Regression and SVM and Its Application in Peptide QSAR Modeling. Acta Phys. -Chim. Sin., 2013, 29(03): 498-507.

URL:

http://www.whxb.pku.edu.cn/10.3866/PKU.WHXB201301042     OR     http://www.whxb.pku.edu.cn/Y2013/V29/I03/498

(1) Ding, J. L.; Ho, B. Drug Dev. Res. 2004, 62 (4), 317.
(2) Anfinsen, C. B.; Haber, E.; Sela, M.; White, F. H., Jr. Proc.Natl. Acad. Sci . U. S. A. 1961, 47, 1309. doi: 10.1073/pnas.47.9.1309
(3) Sneath, P. H. J. Theor. Biol. 1966, 12 (2), 157. doi: 10.1016/0022-5193(66)90112-3
(4) Kidera, A.; Konishi, Y.; Oka, M.; Ooi, T.; Scheraga, H. A.J. Protein Chem. 1985, 4 (1), 23. doi: 10.1007/BF01025492
(5) Hellberg, S.; Eriksson, L.; Jonsson, J.; Lindgren, F.; Sjöström,M.; Skagerberg, B.;Wold, S.; Andrews, P. Int. J. Pept. ProteinRes. 1991, 37 (5), 414.
(6) Sandberg, M.; Eriksson, L.; Jonsson, J.; Sjöström, M.;Wold, S.J. Med. Chem. 1998, 41 (14), 2481. doi: 10.1021/jm9700575
(7) Liang, G. Z.; Mei, H.; Zhou, P.; Zhou, Y.; Li, Z. L. ActaPhys. -Chim. Sin. 2006, 22, 388. [梁桂兆, 梅虎, 周鹏,周原, 李志良. 物理化学学报, 2006, 22, 388.] doi: 10.3866/PKU.WHXB20060327
(8) Liang, G. Z.; Zhou, P.; Zhou, Y.; Zhang, Q. X.; Li, Z. L. ActaChim. Sin. 2006, 64 (5), 393. [梁桂兆, 周鹏, 周原, 张巧霞, 李志良. 化学学报, 2006, 64 (5), 393.]
(9) Zhou, Y.; Mei, H.; Yang, L.; Zhou, P.; Yang, S. B.; Li, Z. L.Chem. J. Chin. Univ. 2007, 28 (7), 1263. [周原, 梅虎,杨力, 周鹏, 杨善斌, 李志良. 高等学校化学学报, 2007,28 (7), 1263.]
(10) Yang, S. B.; Xia, Z. N.; Shu, M.; Mei, H.; Lü, F. L.; Zhang, M.;Wu, Y. Q.; Li, Z. L. Chem. J. Chin. Univ. 2008, 29 (11), 2213.[杨善彬, 夏之宁, 舒茂, 梅虎, 吕凤林, 张梅, 吴玉乾,李志良. 高等学校化学学报, 2008, 29 (11), 2213.]
(11) Li, Z. L.; Li, G. R.; Shu, M.; Sun, J. Y.; Yang, S. B.; Mei, H.;Zhang, M. J.; Zhou, P.;Wu, S. R.; Chen, G. H.; Lü, F. L.; Lü, T.T. Sci. China Ser. B: Chem. 2008, 38 (8), 745. [李志良, 李根容, 舒茂, 孙家英, 杨善斌, 梅虎, 张梦军, 周萍, 吴世荣,陈国华, 吕凤林, 吕廷亭. 中国科学B 辑: 化学, 2008, 38 (8),745.]
(12) Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.;Katayama, T.; Kanehisa, M. Nucl. Acids Res. 2008, 36 (1),D202.
(13) Dash, M.; Liu, H. Intell. Data Anal. 1997, 1 (3), 131.
(14) Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.;Gaasenbeek, M.; Mesirov, J. P.; Coller, H.; Loh, M. L.;Downing, J. R.; Caligiuri, M. A.; Bloomfield, C. D.; Lander, E.S. Science 1999, 286 (5439), 531. doi: 10.1126/science.286.5439.531
(15) Kononerko, I. Estimating Attributes: Analysis and Extension ofRelief. In Lecture Notes in Computer Science, Proceedings ofEuropean Conference on Machine Learning, Catania, Italy,April 6-8, 1994; Bergadano, F., Raedt, L. D. Eds.; Springer:Heidelberg, 1994; pp 171-182.
(16) Liu, H.; Setiono, R. A Probabilistic Approach to FeatureSelection-a Filter Solution. In Machine Learning, Proceedingsof the Thirteenth International Conference on MachineLearning, Bari, Italy, July 3-6, 1996; Saitta, L. Ed.; MorganKaufmann: San Fransisco, 1996; pp 319-327.
(17) Kohavi, R.; John, G. H. Artif. Intel. 1997, 97 (1-2), 273.doi: 10.1016/S0004-3702(97)00043-X
(18) Destrero, A.; Mosci, S.; De Mol, C.; Verri, A.; Odone, F.Comput. Manag. Sci. 2008, 6 (1), 25.
(19) Vapnik, V. N. The Nature of Statistical Learning Theory;Springer-Verlag: New York, 1995; pp 87-189.
(20) Hoerl, A. E.; Kennard, R.W. Technometrics 1970, 12, 55.doi: 10.1080/00401706.1970.10488634
(21) Tan, X. S.; Yuan, Z. M.; Zhou, T. J.;Wang, C. J.; Xiong, J. Y.Chem. J. Chin. Univ. 2008, 29 (1), 95. [谭显胜, 袁哲明, 周铁军, 王春娟, 熊洁仪. 高等学校化学学报, 2008, 29 (1), 95.]
(22) Chang, C. C.; Lin, C. J. ACM TIST 2011, 2 (3), 1.
(23) Tropsha, A.; Gramatica, P.; Gombar, V. K. QSAR Comb. Sci.2003, 22 (1), 69.
(24) Cocchi, M.; Johansson, E. Quant. Struct. -Act. Relat. 1993, 12 (1), 1. doi: 10.1002/qsar.v12:1
(25) Collantes, E. R.; Dunn,W. J., III. J. Med. Chem. 1995, 38 (14),2705. doi: 10.1021/jm00014a022
(26) Mei, H.; Liang, G. Z.; Zhou, Y.; Li, Z. L. Chin. Sci. Bull. 2005,50 (16), 1703. [梅虎, 梁桂兆, 周原, 李志良. 科学通报,2005, 50 (16), 1703.] doi: 10.1360/982005-58
(27) Mei, H.; Zhou, Y.; Sun, L. L.; Li, Z. L. Chemistry 2005, (7),534. [梅虎, 周原, 孙立力, 李志良. 化学通报, 2005, (7),534.] doi: 10.3870/zgzzhx.2012.01.022
(28) Liang, G. Z. Construction of Representation Techniques andInvestigation on Structure-Activity Relationship for BiologicalSequences. Ph. D. Dissertation, Chongqing University,Chongqing, 2007. [梁桂兆. 生物序列表征体系构建及结构与功能关系研究[D]. 重庆: 重庆大学, 2007.]
(29) Tan, X. S.;Wang, Z. M.; Tan, S. Q.; Yuan, Z. M.; Xiong, X. Y.Journal of System Simulation 2009, 21 (24), 7795. [谭显胜,王志明, 谭泗桥, 袁哲明, 熊兴耀. 系统仿真学报, 2009, 21 (24), 7795.]
(30) Meek, J. L. Proc. Natl. Acad. Sci. U. S. A. 1980, 77 (3), 1632.doi: 10.1073/pnas.77.3.1632
(31) Harpaz, Y.; Gerstein, M.; Chothia, C. Structure 1994, 2 (7), 641.doi: 10.1016/S0969-2126(00)00065-4
(32) Chothia, C. Nature 1975, 254 (5498), 304. doi: 10.1038/254304a0
(33) Rackovsky, S.; Scheraga, H. A. Macromolecules 1982, 15 (5),1340. doi: 10.1021/ma00233a025
(34) Robson, B.; Suzuki, E. J. Mol. Biol. 1976, 107 (3), 327. doi: 10.1016/S0022-2836(76)80008-3
(35) Parker, J. M. R.; Guo, D.; Hodges, R. S. Biochemistry 1986, 25 (19), 5425. doi: 10.1021/bi00367a013
(36) Bundi, A.;Wüthrich, K. Biopolymers 1979, 18 (2), 285.
(37) Mei, H.; Zhou, Y.; Liao, Z. H.; Li, Z. L. Acta Chim. Sin. 2006,64 (9), 949. [梅虎, 周原, 廖志华, 李志良. 化学学报,2006, 64 (9), 949.]
(38) Frahm, N.; Korber, B. T.; Adams, C. M.; Szinger, J. J.; Draenert,R.; Addo, M. M.; Feeney, M. E.; Yusim, K.; Sango, K.; Brown,N. V.; SenGupta, D.; Piechocka-Trocha, A.; Simonis, T.;Marincola, F. M.;Wurcel, A. G.; Stone, D. R.; Russell, C. J.;Adolf, P.; Cohen, D.; Roach, T.; StJohn, A.; Khatri, A.; Davis,K.; Mullins, J.; Goulder, P. J. R.;Walker, B. D.; Brander, C.J. Virol. 2004, 78 (5), 2187. doi: 10.1128/JVI.78.5.2187-2200.2004
(39) Doytchinova, I. A.; Flower, D. R. J. Med. Chem. 2001, 44,3572. doi: 10.1021/jm010021j
(40) Liang, G. Z.; Li, S. Z. Biopolymers 2007, 88 (3), 401. doi: 10.1002/bip.v88:3
(41) Levitt, M. J. Mol. Biol. 1976, 104, 59. doi: 10.1016/0022-2836(76)90004-8
(42) Tsai, J.; Taylor, R.; Chothia, C.; Gerstein, M. J. Mol. Biol. 1999,290 (1), 253. doi: 10.1006/jmbi.1999.2829
(43) Biou, V.; Gibrat, J. F.; Levin, J. M.; Robson, B.; Garnier, J.Protein Eng. 1988, 2 (3), 185. doi: 10.1093/protein/2.3.185
(44) Schwartz, R.; Istrail, S.; King, J. Protein Science 2001, 10 (5),1023.
(45) Sueki, M.; Lee, S.; Powers, S. P.; Denton, J. B.; Konishi, Y.;Scheraga, H. A. Macromolecules 1984, 17 (2), 148. doi: 10.1021/ma00132a006
(46) Chothia, C. Nature 1974, 248, 338. doi: 10.1038/248338a0
(47) Naderi-Manesh, H.; Sadeghi, M.; Arab, S.; Moosavi Movahedi,A. A. Proteins 2001, 42 (4), 452. doi: 10.1002/1097-0134(20010301)42:4<>1.0.CO;2-N

[1] Hassan GOLMOHAMMADI,Zahra DASHTBOZORGI,Sajad KHOOSHECHIN. Prediction of Blood-to-Brain Barrier Partitioning of Drugs and Organic Compounds Using a QSPR Approach[J]. Acta Phys. -Chim. Sin., 2017, 33(6): 1160-1170.
[2] Hassan GOLMOHAMMADI,Zahra DASHTBOZORGI,Sajad KHOOSHECHIN. Developing a Support Vector Machine Based QSPR Model to PredictGas-to-Benzene Solvation Enthalpy of Organic Compounds[J]. Acta Phys. -Chim. Sin., 2017, 33(5): 918-926.
[3] Bing. HE,Yong. LUO,Bing-Ke. LI,Ying. XUE,Luo-Ting. YU,Xiao-Long. QIU,Teng-Kuei. YANG. Predicting and Virtually Screening Breast Cancer Targeting Protein HEC1 Inhibitors by Molecular Descriptors and Machine Learning Methods[J]. Acta Phys. -Chim. Sin., 2015, 31(9): 1795-1802.
[4] Hai-Chun. LIU,Shuai. LU,Ting. RAN,Yan-Min. ZHANG,Jin-Xing. XU,Xiao. XIONG,An-Yang. XU,Tao. LU,Ya-Dong. CHEN. Accurate Activity Predictions of B-Raf Type II Inhibitors via Molecular Docking and QSAR Methods[J]. Acta Phys. -Chim. Sin., 2015, 31(11): 2191-2206.
[5] LI Yong, ZHOU Wei, DAI Zhi-Jun, CHEN Yuan, WANG Zhi-Ming, YUAN Zhe-Ming. Predicting the Protein Folding Rate Based on Sequence Feature Screening and Support Vector Regression[J]. Acta Phys. -Chim. Sin., 2014, 30(6): 1091-1098.
[6] SHI Jing-Jie, CHEN Li-Ping, CHEN Wang-Hua. QSPR Models of Compound Viscosity Based on Iterative Self-Organizing Data Analysis Technique and Ant Colony Algorithm[J]. Acta Phys. -Chim. Sin., 2014, 30(5): 803-810.
[7] LI Bing-Ke, CONG Yong, TIAN Zhi-Yue, XUE Ying. Predicting and Virtually Screening the Selective Inhibitors of MMP-13 over MMP-1 by Molecular Descriptors and Machine Learning Methods[J]. Acta Phys. -Chim. Sin., 2014, 30(1): 171-182.
[8] CONG Yong, XUE Ying. Quantitative Structure-Activity Relationship Study of the Non-Nucleoside Inhibitors of HCV NS5B Polymerase by Machine Learning Methods[J]. Acta Phys. -Chim. Sin., 2013, 29(08): 1639-1647.
[9] SUN Sang-Dun, MI Si-Qi, YOU Jing, YU Ji-Liang, HU Song-Qing, LIU Xin-Yong. HQSAR Study and Molecular Design of Benzimidazole Derivatives as Corrosion Inhibitors[J]. Acta Phys. -Chim. Sin., 2013, 29(06): 1192-1200.
[10] KANG Cong-Min, ZHAO Xu-Hao, WANG Xin-Yu, CHENG Jia-Gao, LÜ Ying-Tao. QSAR and Molecular Docking on Five-Membered Heterocyclopyrimidines as Thymidylate Synthase Inhibitors[J]. Acta Phys. -Chim. Sin., 2013, 29(02): 431-438.
[11] LÜ Wei, XUE Ying, MENG Qing-Wei. Classification Prediction of Inhibitors of H1N1 Neuraminidase by Machine Learning Methods[J]. Acta Phys. -Chim. Sin., 2013, 29(01): 217-223.
[12] SHI Jing-Jie, CHEN Li-Ping, CHEN Wang-Hua, SHI Ning, YANG Hui, XU Wei. Prediction of the Thermal Conductivity of Organic Compounds Using Heuristic and Support Vector Machine Methods[J]. Acta Phys. -Chim. Sin., 2012, 28(12): 2790-2796.
[13] HUO Wei-Feng, GAO Na, YAN Yan, LI Ji-Yang, YU Ji-Hong, XU Ru-Ren. Decision Trees Combined with Feature Selection for the Rational Synthesis of Aluminophosphate AlPO4-5[J]. Acta Phys. -Chim. Sin., 2011, 27(09): 2111-2117.
[14] DAI Zhi-Jun, ZHOU Wei, YUAN Zhe-Ming. A Novel Method of Nonlinear Rapid Feature Selection for High Dimensional Data and Its Application in Peptide QSAR Modeling Based on Support Vector Machine[J]. Acta Phys. -Chim. Sin., 2011, 27(07): 1654-1660.
[15] Lü Wei, XUE Ying. Prediction of Hepatitis C Virus Non-Structural Proteins 5B Polymerase Inhibitors Using Machine Learning Methods[J]. Acta Phys. -Chim. Sin., 2011, 27(06): 1407-1416.