Protein tertiary structure is indispensible in revealing the biological functions of proteins. De novo perdition of protein tertiary structure is dependent on protein fold recognition. This study proposes a novel method for prediction of protein fold types which takes primary sequence as input. The proposed method, PFP-RFSM, employs a random forest classifier and a comprehensive feature representation, including both sequence and predicted structure descriptors. Particularly, we propose a method for generation of features based on sequence motifs and those features are firstly employed in protein fold prediction. PFP-RFSM and ten representative protein fold predictors are validated in a benchmark dataset consisting of 27 fold types. Experiments demonstrate that PFP-RFSM outperforms all existing protein fold predictors and improves the success rates by 2%-14%. The results suggest sequence motifs are effective in classification and analysis of protein sequences.
 Luscombe, N.M., Laskowski, R.A. and Thornton, J.M. (2001) Amino acid-base interactions: A three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Research, 29, 2860-2874.
 Jones, S. and Thornton, J.M. (1996) Principles of proteinprotein interactions. Proceedings of the National Academy of Sciences of the United States of America, 93, 13-20. http://dx.doi.org/10.1073/pnas.93.1.13
 Alaei, L., Moosavi-Movahedi, A.A., Hadi, H., Saboury, A.A., Ahmad, F. and Amani, M. (2012) Thermal inactivation and conformational lock of bovine carbonic anhydrase. Protein and Peptide Letters, 14, 852-858.
 Chou, K.C. (2004) Review: Structural bioinformatics and its impact to biomedical science. Current Medicinal Chemistry, 11, 2105-2134.
 Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., et al. (2000) The protein data bank. Nucleic Acids Research, 28, 235-242.
 Pruitt, K.D., Tatusova, T., Brown, G.R. and Maglott, D.R. (2012) NCBI reference sequences (RefSeq), current status, new features and genome annotation policy. Nucleic Acids Research, 40, D130-D135.
 Ginalski, K. (2006) Comparative modeling for protein structure prediction. Current Opinion in Structural Biology, 16, 172-177.
 Skolnick, J. and Brylinski, M. (2008) A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proceedings of the National Academy of Sciences of the United States of America, 105, 129-134. http://dx.doi.org/10.1073/pnas.0707684105
 Andreeva, A., Howorth, D., Chandonia, J.M., Brenner, S.E., Hubbard, T.P., et al. (2008) Data growth and its impact on the SCOP database: New developments. Nucleic Acids Research, 36, D419-D425.
 Cuff, A.L., Sillitoe, I., Lewis, T., Redfern, O.C., Garratt, R., et al. (2009) The CATH classification revisited—Architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Research, 37, D310-D314. http://dx.doi.org/10.1093/nar/gkn877
 Chen, K., Kurgan, L.A. and Ruan, J. (2008) Prediction of protein structural class using novel evolutionary collocation-based sequence representation. Journal of Computational Chemistry, 29, 1596-1604.
 Ding, Y.S., Zhang, T.L. and Chou, K.C. (2007) Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. Protein and Peptide Letters, 14, 811-815.
 Ding, C.H. and Dubchak, I. (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17, 349-358.
 Okun, O. (2004) Protein fold recognition with K-local hyperplane distance nearest neighbor algorithm. Proceedings of the 2nd European Workshop on Data Mining and Text Mining in Bioinformatics, 1, 51-57.
 Nanni, L. (2006) A novel ensemble of classifiers for protein fold recognition. Neurocomputing, 69, 2434-2437.
 Shen, H.B. and Chou, K.C. (2006) Ensemble classifier for protein fold pattern recognition. Bioinformatics, 22, 1717-1722.
 Yang, T. and Kecman, V. (2008) Adaptive local hyperplane classification. Neurocomputing, 71, 3001-3004.
 Yang, T., Kecman, V., Cao, L., Zhang, C. and Huang, J.Z. (2011) Margin-based ensemble classifier for protein fold recognition. Expert Systems, 38, 12348-12355.
 Shen, H.B. and Chou, K.C. (2009) Predicting protein fold pattern with functional domain and sequential evolution information. Journal of Theoretical Biology, 256, 441-446. http://dx.doi.org/10.1016/j.jtbi.2008.10.007
 Liu, L., Hu, X.Z., Liu, X.X., Wang, Y. and Li, S.B. (2012) Predicting protein fold types by the general form of chou’s pseudo amino acid composition: Approached from optimal feature extractions. Protein & Peptide Letters, 19, 439-449. http://dx.doi.org/10.2174/092986612799789378
 Chen, K. and Kurgan, L. (2007) PFRES: Protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics, 23, 2843-2850. http://dx.doi.org/10.1093/bioinformatics/btm475
 Chou, K.C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review). Journal of Theoretical Biology, 273, 236-247. http://dx.doi.org/10.1016/j.jtbi.2010.12.024
 Chen, W., Feng, P.M., Lin, H. and Chou, K.C. (2013) iRSpot-PseDNC: Identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research, 41, e69. http://dx.doi.org/10.1093/nar/gks1450
 Xu, Y., Shao, X.J., Wu, L.Y., Deng, N.Y. and Chou, K.C. (2013) iSNO-AAPair: Incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ, 1, e171.
 Xiao, X., Min, J.L., Wang, P. and Chou, K.C. (2013) iCDI-PseFpt: Identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. Journal of Theoretical Biology, 337C, 71-79.
 Xiao, X., Min, J.L., Wang, P. and Chou, K.C. (2013) iGPCR-Drug: A web server for predicting interaction between GPCRs and drugs in cellular networking. PLoS One, 8, e72234.
 Feng, P.M., Chen, W., Lin, H. and Chou, K.C. (2013) iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Analytical Biochemistry, 442, 118-125.
 McGuffin, L.J., Bryson, K. and Jones, D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics, 16, 404-405.
 Faraggi, E., Xue, B. and Zhou, Y. (2009) Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guidedlearning through a two-layer neural network. Proteins, 74, 847-856. http://dx.doi.org/10.1002/prot.22193
 Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., et al. (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 25, 3389-3402.
 Chou, K.C. and Zhang, C.T. (1995) Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology, 30, 275-349.
 Ding, Y.S., Zhang, T.L. and Chou, K.C. (2007) Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. Protein & Peptide Letters, 14, 811-815.
 Harihar, B. and Selvaraj, S. (2011) Analysis of rate-limiting long-range contacts in the folding rate of three-state and two-state Proteins. Protein and Peptide Letters, 18, 1042-1052.
 Chou, K.C. and Shen, H.B. (2010) A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0. PLoS One, 5, e11335.
 Chou, K.C. (2009) REVIEW: Recent advances in developing web-servers for predicting protein attributes. Current Proteomics, 6, 262-274.
 Chou, K.C. (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 43, 246-255. http://dx.doi.org/10.1002/prot.1035
 Chou, K.C. (2011) iLoc-Euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. Journal of Theoretical Biology, 273, 236-247.
 Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, L., Ren, J., Li, W.W. and Noble, W.S. (2009) MEME SUITE: Tools for motif discovery and searching. Nucleic Acids Research, 37, W202-W208.
 Kerthi, S.S., Shevade, S.K., Bhattacharyya, C. and Murphy, K.R.K. (2001) Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13, 637-649. http://dx.doi.org/10.1162/089976601300014493
 Aha, D. and Kibler, D. (1991) Instance-based learning algorithms. Machine Learning, 6, 37-66.
 Mizianty, M.J. and Kurgan, L.A. (2009) Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences. BMC Bioinformatics, 10, 414.
 Lin, S.X. and Lapointe, J. (2013) Theoretical and experimental biology in one—A symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers. Journal of Biomedical Science and Engineering, 6, 435-442. http://dx.doi.org/10.4236/jbise.2013.64054
 Chou, K.C. and Shen, H.B. (2009) Review: Recent advances in developing web-servers for predicting protein attributes. Natural Science, 2, 63-92.