JBiSE  Vol.11 No.6 , June 2018
Improving Protein Sequence Classification Performance Using Adjacent and Overlapped Segments on Existing Protein Descriptors
In protein sequence classification research, it is popular to convert a variable length sequence of protein into a fixed length numerical vector by using various descriptors, for instance, composition of k-mer composition. Such position-independent descriptors are useful since they are applicable to any length of sequence; however, positional information of subsequence is discarded even though it might have high contribution to classification performance. To solve this problem, we divided the original sequence into some segments, and then calculated the numerical features for them. It enables us to partially introduce positional information (for instance, compositions of serine in anterior and posterior segments of a sequence). Through comprehensive experiments on the number of segments and length of overlapping region, we found our classification approach with sequence segmentation and feature selection is effective to improve the performance. We evaluated our approach on three protein classification problems and achieved significant improvement in all cases which have a dataset with sufficient amino acid in each sequence. This result has shown the great potential of using additional segments in protein sequence classification to solve other sequence problems in bioinformatics.

Cite this paper
Faisal, M. , Abapihi, B. , Nguyen, N. , Purnama, B. , Delimayanti, M. , Phan, D. , Lumbanraja, F. , Kubo, M. and Satou, K. (2018) Improving Protein Sequence Classification Performance Using Adjacent and Overlapped Segments on Existing Protein Descriptors. Journal of Biomedical Science and Engineering, 11, 126-143. doi: 10.4236/jbise.2018.116012.
[1]   Xiao, N., Cao, D.-S., Zhu, M.-F. and Xu, Q.-S. (2015) protr/ProtrWeb: R Package and Web Server for Generating Various Numerical Representation Schemes of Protein Sequences. Bioinformatics, 31, 1857-1859.

[2]   Bhasin, M. and Raghava, G.P.S. (2004) Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition. The Journal of Biological Chemistry, 279, 23262-23266.

[3]   Feng, Z.-P. and Zhang, C.-T. (2000) Prediction of Membrane Protein Types Based on the Hydrophobic Index of Amino Acids. Journal of Protein Chemistry, 19, 269-275.

[4]   Dubchak, I., Muchnik, I., Holbrook, S.R. and Kim, S.H. (1995) Prediction of Protein Folding Class Using Global Description of Amino Acid Sequence. Proceedings of the National Academy of Sciences of the USA, 92, 8700-8704.

[5]   Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y. and Jiang, H. (2007) Predicting Protein-Protein Interactions Based Only on Sequences Information. Proceedings of the National Academy of Sciences of the USA, 104, 4337-4341.

[6]   Chou, K.-C. (2000) Prediction of Protein Subcellular Loca-tions by Incorporating Quasi-Sequence-Order Effect. Biochemical and Biophysical Research Communications, 278, 477-483.

[7]   Chou, K.-C. (2001) Prediction of Protein Cellular Attributes Using Pseudo-Amino Acid Composition. Proteins: Structure, Function, and Bioinformatics, 44, 60.

[8]   Chou, K.-C. (2005) Using Amphiphilic Pseudo Amino Acid Composi-tion to Predict Enzyme Subfamily Classes. Bioinformatics, 21, 10-19.

[9]   Phan, D., Nguyen, N.G., Lumbanraja, F.R., Faisal, M.R., Abapihi, B., Purnama, B., Delimayanti, M.K., Kubo, M., and Satou, K. (2017) Combined Use of k-Mer Numerical Features and Position-Specific Categorical Features in Fixed-Length DNA Sequence Classification. Journal of Biomedical Science and Engineering, 10, 390-401.

[10]   Xiao, N., Cao, D.-S., Zhu, M.-F. and Xu, Q.-S. (2017) protr: R Package for Generating Various Numerical Representation Schemes of Protein Sequences.

[11]   Rangwala, H. and Karypis, G. (2005) Profile-Based Direct Kernels for Remote Homology Detection and Fold Recognition. Bioinformatics, 21, 4239-4247.

[12]   Asgari, E. and Mofrad, M.R.K. (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One, 10, 1-11.

[13]   Ong, S.A.K., Lin, H.H., Chen, Y.Z., Li, Z.R. and Cao, Z. (2007) Efficacy of Different Protein Descriptors in Predicting Protein Functional Families. BMC Bioinformatics, 8, 300.

[14]   Liu, B., Wang, X., Zou, Q., Dong, Q. and Chen, Q. (2013) Protein Remote Homology Detection by Combining Chou’s Pseudo Amino Acid Composition and Profile-Based Protein Representation. Molecular Informatics, 32, 775-782.

[15]   Liaw, A. and Wiener, M. (2002) Classification and Regression by randomForest. R News, 2, 18-22.

[16]   Karatzoglou, A., Smola, A., Hornik, K. and Zeileis, A. (2004) kernlab-An {S4} Package for Kernel Methods in {R}. Journal of Statistical Software, 11, 1-20.

[17]   Wang, P., Xiao, X. and Chou, K.-C. (2011) NR-2L: A Two-Level Predictor for Identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features. PLoS One, 6, e23505.

[18]   Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X. and Chen, Y.Z. (2003) SVM-Prot: Web-Based Support Vector Machine Software for Functional Classification of a Protein from Its Primary Sequence. Nucleic Acids Research, 31, 3692-3697.

[19]   Wei, L., Xing, P., Su, R., Shi, G., Ma, Z.S. and Zou, Q. (2017) CPPred-RF: A Sequence-Based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. Journal of Proteome Research, 16, 2044-2053.