Back
 JBiSE  Vol.6 No.12 , December 2013
PFP-RFSM: Protein fold prediction by using random forests and sequence motifs
Abstract: Protein tertiary structure is indispensible in revealing the biological functions of proteins. De novo perdition of protein tertiary structure is dependent on protein fold recognition. This study proposes a novel method for prediction of protein fold types which takes primary sequence as input. The proposed method, PFP-RFSM, employs a random forest classifier and a comprehensive feature representation, including both sequence and predicted structure descriptors. Particularly, we propose a method for generation of features based on sequence motifs and those features are firstly employed in protein fold prediction. PFP-RFSM and ten representative protein fold predictors are validated in a benchmark dataset consisting of 27 fold types. Experiments demonstrate that PFP-RFSM outperforms all existing protein fold predictors and improves the success rates by 2%-14%. The results suggest sequence motifs are effective in classification and analysis of protein sequences.  
Cite this paper: Li, J. , Wu, J. and Chen, K. (2013) PFP-RFSM: Protein fold prediction by using random forests and sequence motifs. Journal of Biomedical Science and Engineering, 6, 1161-1170. doi: 10.4236/jbise.2013.612145.
References

[1]   Luscombe, N.M., Laskowski, R.A. and Thornton, J.M. (2001) Amino acid-base interactions: A three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Research, 29, 2860-2874.
http://dx.doi.org/10.1093/nar/29.13.2860

[2]   Jones, S. and Thornton, J.M. (1996) Principles of proteinprotein interactions. Proceedings of the National Academy of Sciences of the United States of America, 93, 13-20. http://dx.doi.org/10.1073/pnas.93.1.13

[3]   Alaei, L., Moosavi-Movahedi, A.A., Hadi, H., Saboury, A.A., Ahmad, F. and Amani, M. (2012) Thermal inactivation and conformational lock of bovine carbonic anhydrase. Protein and Peptide Letters, 14, 852-858.
http://dx.doi.org/10.2174/092986612801619507

[4]   Chou, K.C. (2004) Review: Structural bioinformatics and its impact to biomedical science. Current Medicinal Chemistry, 11, 2105-2134.
http://dx.doi.org/10.2174/0929867043364667

[5]   Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., et al. (2000) The protein data bank. Nucleic Acids Research, 28, 235-242.
http://dx.doi.org/10.1093/nar/28.1.235

[6]   Pruitt, K.D., Tatusova, T., Brown, G.R. and Maglott, D.R. (2012) NCBI reference sequences (RefSeq), current status, new features and genome annotation policy. Nucleic Acids Research, 40, D130-D135.
http://dx.doi.org/10.1093/nar/gkr1079

[7]   Ginalski, K. (2006) Comparative modeling for protein structure prediction. Current Opinion in Structural Biology, 16, 172-177.
http://dx.doi.org/10.1016/j.sbi.2006.02.003

[8]   Skolnick, J. and Brylinski, M. (2008) A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proceedings of the National Academy of Sciences of the United States of America, 105, 129-134. http://dx.doi.org/10.1073/pnas.0707684105

[9]   Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.P., et al. (2008) Data growth and its impact on the SCOP database: New developments. Nucleic Acids Research, 36, D419-D425.
http://dx.doi.org/10.1093/nar/gkm993

[10]   Cuff, A.L., Sillitoe, I., Lewis, T., Redfern, O.C., Garratt, R., et al. (2009) The CATH classification revisited—Architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Research, 37, D310-D314. http://dx.doi.org/10.1093/nar/gkn877

[11]   Chen, K., Kurgan, L.A. and Ruan, J. (2008) Prediction of protein structural class using novel evolutionary collocation-based sequence representation. Journal of Computational Chemistry, 29, 1596-1604.
http://dx.doi.org/10.1002/jcc.20918

[12]   Ding, Y.S., Zhang, T.L. and Chou, K.C. (2007) Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. Protein and Peptide Letters, 14, 811-815.
http://dx.doi.org/10.2174/092986607781483778

[13]   Ding, C.H. and Dubchak, I. (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17, 349-358.
http://dx.doi.org/10.1093/bioinformatics/17.4.349

[14]   Okun, O. (2004) Protein fold recognition with K-local hyperplane distance nearest neighbor algorithm. Proceedings of the 2nd European Workshop on Data Mining and Text Mining in Bioinformatics, 1, 51-57.

[15]   Bologna, G. and Appel, R.D. (2002) A comparison study on protein fold recognition. Proceedings of the 9th International Conference on Neural Information Processing, 5, 2492-2496.

[16]   Nanni, L. (2006) A novel ensemble of classifiers for protein fold recognition. Neurocomputing, 69, 2434-2437.
http://dx.doi.org/10.1016/j.neucom.2006.01.026

[17]   Shen, H.B. and Chou, K.C. (2006) Ensemble classifier for protein fold pattern recognition. Bioinformatics, 22, 1717-1722.
http://dx.doi.org/10.1093/bioinformatics/btl170

[18]   Yang, T. and Kecman, V. (2008) Adaptive local hyperplane classification. Neurocomputing, 71, 3001-3004.
http://dx.doi.org/10.1016/j.neucom.2008.01.014

[19]   Yang, T., Kecman, V., Cao, L., Zhang, C. and Huang, J.Z. (2011) Margin-based ensemble classifier for protein fold recognition. Expert Systems, 38, 12348-12355.
http://dx.doi.org/10.1016/j.eswa.2011.04.014

[20]   Shen, H.B. and Chou, K.C. (2009) Predicting protein fold pattern with functional domain and sequential evolution information. Journal of Theoretical Biology, 256, 441-446. http://dx.doi.org/10.1016/j.jtbi.2008.10.007

[21]   Liu, L., Hu, X.Z., Liu, X.X., Wang, Y. and Li, S.B. (2012) Predicting protein fold types by the general form of chou’s pseudo amino acid composition: Approached from optimal feature extractions. Protein & Peptide Letters, 19, 439-449. http://dx.doi.org/10.2174/092986612799789378

[22]   Chen, K. and Kurgan, L. (2007) PFRES: Protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics, 23, 2843-2850. http://dx.doi.org/10.1093/bioinformatics/btm475

[23]   Leo, B. (2001) Random forests. Machine Learning, 1, 5-32.

[24]   Chou, K.C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review). Journal of Theoretical Biology, 273, 236-247. http://dx.doi.org/10.1016/j.jtbi.2010.12.024

[25]   Chen, W., Feng, P.M., Lin, H. and Chou, K.C. (2013) iRSpot-PseDNC: Identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research, 41, e69. http://dx.doi.org/10.1093/nar/gks1450

[26]   Xu, Y., Shao, X.J., Wu, L.Y., Deng, N.Y. and Chou, K.C. (2013) iSNO-AAPair: Incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ, 1, e171.
http://dx.doi.org/10.7717/peerj.171

[27]   Xiao, X., Min, J.L., Wang, P. and Chou, K.C. (2013) iCDI-PseFpt: Identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. Journal of Theoretical Biology, 337C, 71-79.
http://dx.doi.org/10.1016/j.jtbi.2013.08.013

[28]   Xiao, X., Min, J.L., Wang, P. and Chou, K.C. (2013) iGPCR-Drug: A web server for predicting interaction between GPCRs and drugs in cellular networking. PLoS One, 8, e72234.
http://dx.doi.org/10.1371/journal.pone.0072234

[29]   Feng, P.M., Chen, W., Lin, H. and Chou, K.C. (2013) iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Analytical Biochemistry, 442, 118-125.
http://dx.doi.org/10.1016/j.ab.2013.05.024

[30]   McGuffin, L.J., Bryson, K. and Jones, D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics, 16, 404-405.
http://dx.doi.org/10.1093/bioinformatics/16.4.404

[31]   Faraggi, E., Xue, B. and Zhou, Y. (2009) Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guidedlearning through a two-layer neural network. Proteins, 74, 847-856. http://dx.doi.org/10.1002/prot.22193

[32]   Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389-3402.
http://dx.doi.org/10.1093/nar/25.17.3389

[33]   Chou, K.C. and Zhang, C.T. (1995) Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology, 30, 275-349.
http://dx.doi.org/10.3109/10409239509083488

[34]   Ding, Y.S., Zhang, T.L. and Chou, K.C. (2007) Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. Protein & Peptide Letters, 14, 811-815.
http://dx.doi.org/10.2174/092986607781483778

[35]   Harihar, B. and Selvaraj, S. (2011) Analysis of rate-limiting long-range contacts in the folding rate of three-state and two-state Proteins. Protein and Peptide Letters, 18, 1042-1052.
http://dx.doi.org/10.2174/092986611796378684

[36]   Chou, K.C. and Shen, H.B. (2010) A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0. PLoS One, 5, e11335.
http://dx.doi.org/10.1371/journal.pone.0011335

[37]   Chou, K.C. (2001) Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics, 43, 246-255.

[38]   Chou, K.C. (2009) REVIEW: Recent advances in developing web-servers for predicting protein attributes. Current Proteomics, 6, 262-274.
http://dx.doi.org/10.2174/157016409789973707

[39]   Chou, K.C. (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 43, 246-255. http://dx.doi.org/10.1002/prot.1035

[40]   Chou, K.C. (2011) iLoc-Euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. Journal of Theoretical Biology, 273, 236-247.
http://dx.doi.org/10.1016/j.jtbi.2010.12.024

[41]   Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, L., Ren, J., Li, W.W. and Noble, W.S. (2009) MEME SUITE: Tools for motif discovery and searching. Nucleic Acids Research, 37, W202-W208.
http://dx.doi.org/10.1093/nar/gkp335

[42]   Kerthi, S.S., Shevade, S.K., Bhattacharyya, C. and Murphy, K.R.K. (2001) Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13, 637-649. http://dx.doi.org/10.1162/089976601300014493

[43]   Cleary, J.G. and Trigg, L.E. (1995) K*: An instancebased learner using an entropic distance measure. Proceedings of the 12th International Conference on Machine Learning, 108-114.

[44]   Aha, D. and Kibler, D. (1991) Instance-based learning algorithms. Machine Learning, 6, 37-66.
http://dx.doi.org/10.1007/BF00153759

[45]   John, G.H. and Langley, P. (1995) Estimating continuous distributions in bayesian classifiers. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, 338-345.

[46]   Mizianty, M.J. and Kurgan, L.A. (2009) Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics, 10, 414.
http://dx.doi.org/10.1186/1471-2105-10-414

[47]   Lin, S.X. and Lapointe, J. (2013) Theoretical and experimental biology in one—A symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers. Journal of Biomedical Science and Engineering, 6, 435-442. http://dx.doi.org/10.4236/jbise.2013.64054

[48]   Chou, K.C. and Shen, H.B. (2009) Review: Recent advances in developing web-servers for predicting protein attributes. Natural Science, 2, 63-92.
http://dx.doi.org/10.4236/ns.2009.12011

 
 
Top