ABSTRACT The knowledge of subnuclear localization in eukaryotic cells is indispensable for under-standing the biological function of nucleus, genome regulation and drug discovery. In this study, a new feature representation was pro-posed by combining position specific scoring matrix (PSSM) and auto covariance (AC). The AC variables describe the neighboring effect between two amino acids, so that they incorpo-rate the sequence-order information; PSSM de-scribes the information of biological evolution of proteins. Based on this new descriptor, a support vector machine (SVM) classifier was built to predict subnuclear localization. To evaluate the power of our predictor, the benchmark dataset that contains 714 proteins localized in nine subnuclear compartments was utilized. The total jackknife cross validation ac-curacy of our method is 76.5%, that is higher than those of the Nuc-PLoc (67.4%), the OET- KNN (55.6%), AAC based SVM (48.9%) and ProtLoc (36.6%). The prediction software used in this article and the details of the SVM parameters are freely available at http://chemlab.scu.edu.cn/ predict_SubNL/index.htm and the dataset used in our study is from Shen and Chou’s work by downloading at http://chou.med.harvard.edu/ bioinf/Nuc-PLoc/Data.htm.
Cite this paper
nullXiao, R. , Guo, Y. , Zeng, Y. , Tan, H. , Tan, H. , Pu, X. and Li, M. (2009) Using position specific scoring matrix and auto covariance to predict protein subnuclear localization. Journal of Biomedical Science and Engineering, 2, 51-56. doi: 10.4236/jbise.2009.21009.
 G. S. Stein, J. B. Lian, W. A. van, J. L. Stein, A. Javed, G. Bar-nes, L. Gerstenfeld, D. Vradii, S. K. Zaidi, and J. Pratap, et al. (2006) Organization of transcriptional regulatory machinery in nuclear microenvironments: Implications for biological control and cancer. Cancer Treatment Reviews, 32, 13-13.
 A. I. Lamond, and W. C. Earnshaw, (1998) Structure and func-tion in the nucleus. Science, 280, 547-553.
 J. M. Bridger, and W. A. Bickmore, (1998) Putting the genome on the map. Trends in Genetics, 14, 403-409.
 H. G. Sutherland, G. K. Mumford, K. Newton, L. V. Ford, R. Farrall, G. Dellaire, J. F. Caceres, and W. A. Bickmore, (2001) Large-scale identification of mammalian proteins localized to nuclear sub-compartments. Human Molecular Genetics, 10, 1995-2011.
 S. K. Zaidi, D. W. Young, A. Javed, J. Pratap, M. Montecino, W. A. van, J. B. Lian, J. L. Stein, and G. S. Stein, (2007) Nuclear microenvironments in biological control and cancer. Nature Re-views Cancer, 7, 454-463.
 R. D. Phair, and T. Misteli, (2000) High mobility of proteins in the mammalian cell nucleus. Nature, 404, 604-609.
 K. C. Chou, and H. B. Shen, (2007) Recent progress in protein subcellular location prediction. Analytical Biochemistry, 370, 1-16.
 K. C. Chou, (2004) Structural bioinformatics and its impact to biomedical science. Current Medicinal Chemistry, 11, 2105- 2134.
 K. C. Chou, D. Q. Wei, and W. Z. Zhong, (2003) Binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against SARS. Biochemical and Bio-physical Research Communications, 308, 148-151.
 Sirois, S., Wei, D. Q., Du, Q. S. and Chou, K. C. (2004) Virtual screening for SARS-CoV protease based on KZ7088 pharma-cophore pointst. Journal of Chemical Information and Computer Sciences, 44, 1111-1122.
 Du, Q. S., Mezey, P. G. and Chou, K. C. (2005) Heuristic mo-lecular lipophilicity potential (HMLP): A 2D-QSAR study to LADH of molecular family pyrazole and derivatives. Journal of Computational Chemistry, 26, 461-470.
 Du, Q. S., Huang, R. B., Wei, Y. T., Du, L. Q. and Chou, K. C. (2008) Multiple field three dimensional quantitative struc-ture-activity relationship (MF-3D-QSAR). Journal of Computa-tional Chemistry, 29, 211-219.
 Prado-Prado, F. J., Gonzalez-Diaz, H., de, V. O., Ubeira, F. M. and Chou, K. C. (2008) Unified QSAR approach to antimicrobi-als. Part 3: First multi-tasking QSAR model for Input-Coded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds. Bioorganic and Medici-nal Chemistry, 16, 5871-5880.
 Chou, K. C. and Shen, H. B. (2008) Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nature Protocols, 3, 153-162.
 Chou, K. C. and Shen, H. B. (2007) MemType-2L: a web server for predicting membrane proteins and their types by incorporat-ing evolution information through Pse-PSSM. Biochemical and Biophysical Research Communications, 360, 339-345.
 Shen, H. B. and Chou, K. C. (2007) EzyPred: a top-down ap-proach for predicting enzyme functional classes and subclasses. Biochemical and Biophysical Research Communications, 364, 53-59.
 Chou, K. C. and Shen, H. B. (2008) ProtIdent: A web server for identifying proteases and their types by fusing functional domain and sequential evolution information. Biochemical and Bio-physical Research Communications, 376, 321-325.
 Chou, K. C. (1996) Prediction of human immunodeficiency virus protease cleavage sites in proteins. Analytical Biochemis-try, 233, 1-14.
 Shen, H. B. and Chou, K. C. (2008) HIVcleave: a web-server for predicting human immunodeficiency virus protease cleavage sites in proteins. Analytical Biochemistry, 375, 388-390.
 Chou, K. C. and Shen, H. B. (2007) Signal-CF: A sub-site-coupled and window-fusing approach for predicting signal peptides. Biochemical and Biophysical Research Communications, 357, 633-640.
 Shen, H. B. and Chou, K. C. (2007) Signal-3L: A 3-layer ap-proach for predicting signal peptides. Biochemical and Bio-physical Research Communications, 363, 297-303.
 Chou, K. C. (2000) Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochemical and Bio-physical Research Communications, 278, 477-483.
 Chou, K. C. and Cai, Y. D. (2002) Using functional domain composition and support vector machines for prediction of pro-tein subcellular location. Journal of Biological Chemistry, 277, 45765-45769.
 Cai, Y. D., Liu, X. J., Xu, X. B. and Chou, K. C. (2002) Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. Journal of Cellular Biochemistry, 84, 343-348.
 Cai, Y. D., Liu, X. J. and Chou, K. C. (2002) Artificial neural network model for predicting protein subcellular location. Computers and Chemistry, 26, 179-182.
 Chou, K. C. and Cai, Y. D. (2004) Prediction and classification of protein subcellular location-sequence-order effect and pseudo amino acid composition. Journal of Cellular Biochemistry, 90, 1250-1260.
 Chou, K. C. and Cai, Y. D. (2004) Prediction of protein subcel-lular locations by GO-FunD-PseAA predictor. Biochemical and Biophysical Research Communications, 320, 1236-1239.
 Gao, Q. B., Wang, Z. Z., Yan, C. and Du, Y. H. (2005) Predic-tion of protein subcellular location using a combined feature of sequence. Febs Letters, 579, 3444-3448.
 Zhang, T. L., Ding, Y. S. and Chou, K. C. (2006) Prediction of protein subcellular location using hydrophobic patterns of amino acid sequence. Computational Biology and Chemistry, 30, 367-371.
 Chou, K. C. and Shen, H. B. (2006) Predicting protein subcellu-lar location by fusing multiple classifiers. Journal of Cellular Biochemistry, 99, 517-527.
 Chou, K. C. and Shen, H. B. (2006) Predicting eukaryotic pro-tein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. Journal of Proteome Research, 5, 1888-1897.
 Zhou, X. B., Chen, C., Li, Z. C. and Zou, X. Y. (2008) Improved prediction of subcellular location for apoptosis proteins by the dual-layer support vector machine. Amino Acids, 35, 383-388.
 Shi, F., Chen, Q. J. and Li, N. N. (2008) Hilbert Huang trans-form for predicting proteins subcellular location. Journal of Biomedical Science and Engineering, 1, 59-63.
 Chou, K. C. and Shen, H. B. (2006) Hum-PLoc: A novel ensem-ble classifier for predicting human protein subcellular localiza-tion. Biochemical and Biophysical Research Communications, 347, 150-157.
 Shen, H. B., Yang, J. and Chou, K. C. (2007) Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction. Amino Acids, 33, 57-67.
 Shen, H. B. and Chou, K. C. (2007) Hum-mPLoc: An ensemble classifier for large-scale human protein subcellular location pre-diction by incorporating samples with multiple sites. Biochemi-cal and Biophysical Research Communications, 355, 1006- 1011.
 Chou, K. C. and Shen, H. B. (2007) Euk-mPLoc: A fusion clas-sifier for large-scale eukaryotic protein subcellular location pre-diction by incorporating multiple sites. Journal of Proteome Re-search, 6, 1728-1734.
 Lei, Z. D. and Dai, Y. (2005) An SVM-based system for pre-dicting protein subnuclear localizations. BMC Bioinformatics, 6, 291-298.
 Lei, Z. D. and Dai, Y. (2006) Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction. BMC Bioinformatics, 7, 491-500.
 Mundra, P., Kumar, M., Kumar, K. K., Jayaraman, V. K. and Kulkarni, B. D. (2007) Using pseudo amino acid composition to predict protein subnuclear localization: Approached with PSSM. Pattern Recognition Letters, 28, 1610-1615.
 Li, F. M. and Li, Q. Z. (2008) Using pseudo amino acid compo-sition to predict protein subnuclear location with improved hy-brid approach. Amino Acids, 34, 119-125.
 Shen, H. B. and Chou, K. C. (2005) Predicting protein subnu-clear location with optimized evidence-theoretic K-nearest clas-sifier and pseudo amino acid composition. Biochemical and Bio-physical Research Communications, 337, 752-756.
 Huang, W. L., Tung, C. W., Huang, H. L., Hwang, S. F. and Ho, S. Y. (2007) ProLoc: Prediction of protein subnuclear localiza-tion using SVM with automatic selection from physicochemical composition features. Biosystems, 90, 573-581.
 Shen, H. B. and Chou, K. C. (2007) Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Engineering Design and Selection, 20, 561-567.
 Jiang, X. Y., Wei, R., Zhao, Y. J. and Zhang, T. L. (2008) Using Chou's pseudo amino acid composition based on approximate entropy and an ensemble of AdaBoost classifiers to predict pro-tein subnuclear location. Amino Acids, 34, 669-675.
 Shen, H. B. and Chou, K. C. (2008) PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Analytical Biochemistry, 373, 386-388.
 Chou, K. C. (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins-Structure Function and Genetics, 43, 246-255.
 Chou, K. C. (2005) Using amphiphilic pseudo amino acid com-position to predict enzyme subfamily classes. Bioinformatics, 21, 10-19.
 Tanford, C. (1962) Contribution of Hydrophobic Interactions to the Stability of the Globular Conformation of Proteins. Journal of the American Chemical Society, 84, 4240-4247.
 Krigbaum, W. R. and Komoriya, A. (1979) Local interactions as a structure determinant for protein molecules: II. Biochimica et Biophysica Acta, 576, 204-248.
 Grantham, R. (1974) Amino acid difference formula to help explain protein evolution. Science, 185, 862-864.
 Guo, Y. Z., Li, M., Lu, M., Wen, Z., Wang, K., Li, G. and Wu, J. (2006) Classifying G protein-coupled receptors and nuclear re-ceptors on the basis of protein power spectrum from fast Fourier transform. AMINO ACIDS, 30, 397-402.
 Hackel, M., Hinz, H. J. and Hedwig, G. R. (1999) Partial molar volumes of proteins: amino acid side-chain contributions derived from the partial molar volumes of some tripeptides over the temperature range 10-90 degrees C. BIOPHYSICAL CHEMISTRY, 82, 35-50.
 Guo, Y. Z., Li, M. L., Wang, K. L., Wen, Z. N., Lu, M. C., Liu, L. X. and Jiang, L. (2006) Fast fourier transform-based support vector machine for prediction of G-protein coupled receptor subfamilies. (Vol 37, pg 759, 2005). ACTA BIOCHIMICA ET BIOPHYSICA SINICA, 38, 456-456.
 Guo, Y.Z., Yu, L.Z., Wen, Z.N. and Li, M.L. (2008) Using sup-port vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Research, 36, 3025-3030.
 Lin, Z. and Pan, X. M. (2001) Accurate prediction of protein secondary structural content. Journal Of Protein Chemistry, 20, 217-220.
 Zhang, C. T., Lin, Z. S., Zhang, Z. D. and Yan, M. (1998) Pre-diction of the helix/strand content of globular proteins based on their primary sequences. Protein Engineering, 11, 971-979.
 Zhang, Z. D., Sun, Z. R. and Zhang, C. T. (2001) A new ap-proach to predict the helix/strand content of globular proteins. Journal Of Theoretical Biology, 208, 65-78.
 Kedarisetti, K. D., Kurgan, L. and Dick, S. (2006) Classifier ensembles for protein structural class prediction with varying homology. Biochemical And Biophysical Research Communica-tions, 348, 981-988.
 Kurgan, L. A. and Homaeian, L. (2006) Prediction of structural classes for protein sequences and domains-Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy. Pattern Recognition, 39, 2323-2343.
 Li, X. and Pan, X. M. (2001) New method for accurate predic-tion of solvent accessibility from protein sequence. Pro-teins-Structure Function And Genetics, 42, 1-5.
 Kurgan, L. and Chen, K. (2007) Prediction of protein structural class for the twilight zone sequences. Biochemical And Bio-physical Research Communications, 357, 453-460.
 Guo, Y. Z., Li, M. L., Lu, M. C., Wen, Z.N. and Huang, Z. T. (2006) Predicting G-protein coupled receptors-G-protein cou-pling specificity based on autocross-covariance transform. Pro-teins, 65, 55-60.
 Ben-Gal, I., Shani, A., Gohr, A., Grau, J., S, A., Shmilovici, A., Posch, S. and Grosse, I. (2005) Identification of transcription factor binding sites with variable-order Bayesian networks. Bio-informatics, 21, 2657-2666.
 Xie, D., Li, A., Wang, M. H., Fan, Z. W. and Feng, H. Q. (2005) LOCSVMPSI: a web server for subcellular localization of eu-karyotic proteins using SVM and profile of PSI-BLAST. Nucleic Acids Research, 33, 105-110.
 Schaffer, A. A., Aravind, L., Madden, T. L., Shavirin, S., Spouge, J. L., Wolf, Y. I., Koonin, E. V. and Altschul, S. F. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Research, 29, 2994-3005.
 Chou, K. C. and Zhang, C. T. (1995) PREDICTION OF PRO-TEIN STRUCTURAL CLASSES. Critical Reviews in Bio-chemistry and Molecular Biology, 30, 275-349.
 Chen, Y. L. and Li, Q. Z. (2007) Prediction of the subcellular location of apoptosis proteins. Journal of Theoretical Biology, 245, 775-783.
 Chen, Y. L. and Li, Q. Z. (2007) Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition. Journal of Theoretical Biology, 248, 377-381.
 Zhou, X. B., Chen, C., Li, Z. C. and Zou, X. Y. (2007) Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. Journal of Theoretical Biology, 248, 546-551.
 Chen, C., Chen, L. X., Zou, X. Y. and Cai, P. X. (2008) Predict-ing protein structural class based on multi-features fusion. Jour-nal of Theoretical Biology, 253, 388-392.
 Chen, K., Kurgan, L. A. and Ruan, J. S. (2008) Prediction of protein structural class using novel evolutionary collocation- based sequence representation. Journal of Computational Chem-istry, 29, 1596-1604.
 Du, P. F. and Li, Y. D. (2008) Prediction of C-to-U RNA editing sites in plant mitochondria using both biochemical and evolu-tionary information. Journal of Theoretical Biology, 253, 579-586.
 Jiang, X. Y., Wei, R., Zhang, T. L. and Gu, Q. (2008) Using the concept of Chou's Pseudo Amino Acid composition to predict apoptosis proteins subcellular location: An approach by ap-proximate entropy. Protein and Peptide Letters, 15, 392-396.
 Jin, Y. H., Niu, B., Feng, K. Y., Lu, W. C., Cai, Y. D. and Li, G. Z. (2008) Predicting subcellular localization with AdaBoost Learner. Protein and Peptide Letters, 15, 286-289.
 Li, F. M. and Li, Q. Z. (2008) Protein Subcellular Location Using Chou’s Pseudo Amino Acid Composition and Improved Hybrid Approach. Protein and Peptide Letters, 15, 612-616.
 Lin, H. (2008) The modified Mahalanobis Discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. Journal of Theoretical Biology, 252, 350-356.
 Lin, H., Ding, H., Guo, F. B., Zhang, A. Y. and Huang, J. (2008) Predicting Subcellular Localization of Mycobacterial Proteins by Using Chou’s Pseudo Amino Acid Composition. Protein and Peptide Letters, 15, 739-744.
 Niu, B., Jin, Y. H., Feng, K. Y., Liu, L., Lu, W. C., Cai, Y. D. and Li, G. Z. (2008) Predicting membrane protein types with bragging learner. Protein and Peptide Letters, 15, 590-594.
 Wang, T., Yang, J., Shen, H. B. and Chou, K. C. (2008) Predict-ing membrane protein types by the LLDA algorithm. Protein and Peptide Letters, 15, 915-921.
 Wu, G. and Yan, S. M. (2008) Prediction of mutations in H3N2 hemagglutinins of influenza A virus from North America based on different datasets. Protein and Peptide Letters, 15, 144-152.
 Zhang, G. Y. and Fang, B. S. (2008) Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou’s amphiphilic pseudo-amino acid composition. Journal of Theoretical Biology, 253, 310-315.
 Cai, Y. D. and Chou, K. C. (2003) Nearest neighbour algorithm for predicting protein subcellular location by combining func-tional domain composition and pseudo-amino acid composition. Biochemical and Biophysical Research Communications, 305, 407-411.
 Park, K. J. and Kanehisa, M. (2003) Prediction of protein sub-cellular locations by support vector machines using composi-tions of amino acids and amino acid pairs. Bioinformatics, 19, 1656-1663.
 Chou, K. C. and Cai, Y. D. (2004) Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition. Journal of Cellular Biochemistry, 91, 1197-1203.
 Cedano, J., Aloy, P., PerezPons, J. A. and Querol, E. (1997) Relation between amino acid composition and cellular location of proteins. Journal of Molecular Biology, 266, 594-600.