ABSTRACT Predicted relative solvent accessibility (RSA) provides useful information for prediction of binding sites and reconstruction of the 3D-structure based on a protein sequence. Recent years observed development of several RSA prediction methods including those that generate real values and those that predict discrete states (buried vs. exposed). We propose a novel method for real value prediction that aims at minimizing the prediction error when compared with six existing methods. The proposed method is based on a two-stage Support Vector Regression (SVR) predictor. The improved prediction quality is a result of the developed composite sequence representation, which includes a custom-selected subset of features from the PSI-BLAST profile, secondary structure predicted with PSI-PRED, and binary code that indicates position of a given residue with respect to sequence termini. Cross validation tests on a benchmark dataset show that our method achieves 14.3 mean absolute error and 0.68 correlation. We also propose a confidence value that is associated with each predicted RSA values. The confidence is computed based on the difference in predictions from the two-stage SVR and a second two-stage Linear Regression (LR) predictor. The confidence values can be used to indicate the quality of the output RSA predictions.
Cite this paper
nullChen, K. , Kurgan, M. and Kurgan, L. (2008) Sequence based prediction of relative solvent accessibility using two-stage support vector regression with confidence values. Journal of Biomedical Science and Engineering, 1, 1-9. doi: 10.4236/jbise.2008.11001.
 Ginalski, K. & Rychlewski L. Protein structure prediction of CASP5 comparative modeling and fold recognition targets using consensus alignment approach and 3D assessment. Proteins 2003, 53(Suppl. 6):410–417.
 Garg, A., Kaur, H. & Raghava, GP. Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure. Proteins 2005, 61(2):318-24.
 Jones, DT. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292(2):195-202.
 Huang, B., Schroeder M. LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation. BMC Struct Biol. 2006, 6:19.
 Chou, KC. Review: Low-frequency collective motion in biomacromolecules and its biological functions. Biophysical Chemistry 1988, 30: 3-48
 Chan HS, Dill KA. Origins of structures in globular proteins. Proc Natl Acad Sci USA 1990, 87: 6388-92.
 Wang JY, Lee HM, Ahmad S. Prediction and evolutionary information analysis of protein solvent accessibility using multiple linear regression. Proteins 2005, 61(3):481-91.
 Arauzo-Bravo MJ, Ahmad S, Sarai A. Dimensionality of amino acid space and solvent accessibility prediction with neural networks. Comput Biol Chem. 2006, (2):160-8.
 Wagner M, Adamczak R, Porollo A, Meller J. Linear regression models for solvent accessibility prediction in proteins. J Comput Biol. 2005, 12(3):355-69.
 Yuan Z, Huang B. Prediction of protein accessible surface areas by support vector regression. Proteins 2004, 57(3):558-64.
 Adamczak R, Porollo A, Meller J. Accurate prediction of solvent accessibility using neural networks-based regression. Proteins 2004, 56(4):753-67.
 Ahmad S, Gromiha MM, Sarai A. Real value prediction of solvent accessibility from amino acid sequence. Proteins 2003, 50(4):629-35.
 Nguyen MN, Rajapakse JC. Two-stage support vector regression approach for predicting accessible surface areas of amino acids. Proteins 2006, 63(3):542-50.
 Wang JY, Ahmad S, Gromiha MM, Sarai A. Look-up tables for protein solvent accessibility prediction and nearest neighbor effect analysis. Biopolymers 2004, 75(3):209-16.
 Xu WL, Li A, Wang X, Jiang ZH, Feng HQ. Improving Prediction of Residue Solvent Accessibility with SVR and Multiple Sequence Alignment Profile. Proceedings of the 27th IEEE Annual Conference on Engineering in Medicine and Biology, Shanghai, China, 2005.
 Cuff JA, Barton GJ. Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 2000, 40(3):502-11.
 Sim J, Kim SY, Lee J. Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method. Bioinformatics 2005, 21(12):2844-9.
 Nguyen MN, Rajapakse JC. Prediction of protein relative solvent accessibility with a two-stage SVM approach. Proteins 2005, 59(1):30-7.
 Kim H, Park H. Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Proteins 2004, 54(3):557-62.
 Ahmad S, Gromiha MM. NETASA: neural network based prediction of solvent accessibility. Bioinformatics 2002, 18(6):819-24.
 Yuan Z, Burrage K, Mattick JS. Prediction of protein solvent accessibility using support vector machines. Proteins 2002, 48(3):566-70.
 Gianese G, Pascarella S. A consensus procedure improving solvent accessibility prediction. J Comput Chem. 2006, 27(5):621-6.
 Naderi-Manesh H, Sadeghi M, Araf S, Movahedi AAM. Predicting of protein surface accessibility with information theory. Proteins 2001, 42:452-459.
 Gianese G, Bossa F, Pascarella S. Improvement in prediction of solvent accessibility by probability profiles. Protein Eng. 2003, 16(12):987-92.
 Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 1997, 17:3389-402.
 Yu L, Liu H. Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning Research. 2004, 5:1205-24.
 Chen K, Kurgan L, Ruan J, Optimization of the Sliding Window Size for Protein Structure Prediction, IEEE Symposium on Comp Intelligence in Bioinformatics and Computational Biology, 2006, 366-72.
 Alex J. Smola, Bernhard Scholkopf. A Tutorial on Support Vector Regression. NeuroCOLT2 Technical Report Series, 1998.
 Shevade SK, Keerthi SS, Bhattacharyya C, Murthy K, Improvements to SMO Algorithm for SVM Regression. Technical Report CD-99-16, Control Division Dept of Mechanical and Production Engineering, National University of Singapore, 1999.
 Witten I, Frank E, Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann, San Francisco, 2005.
 Chou KC, Zhang CT. Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology 1995, 30:275-349.
 Chou KC, Shen HB. Cell-PLoc: A package of web-servers for predicting subcellular localization of proteins in various organisms. Nature Protocols 2008, 3:153-162.
 Chou KC, Shen HB. Review: Recent progresses in protein subcellular location prediction. Analytical Biochemistry 2007, 370:1-16.
 Diao Y, Ma D, Wen Z, Yin J, Xiang J, Li M. Using pseudo amino acid composition to predict transmembrane regions in protein: cellular automata and Lempel-Ziv complexity. Amino Acids 2008, 34:111-117.
 Tan F, Feng X, Fang Z, Li M, Guo Y, Jiang L. Prediction of mitochondrial proteins based on genetic algorithm – partial least squares and support vector machine. Amino Acids 2007, 33:669-675.
 Li FM, Li QZ. Using pseudo amino acid composition to predict protein subnuclear location with improved hybrid approach. Amino Acids 2008, 34:119-125.
 Fang Y, Guo Y, Feng Y, Li M. Predicting DNA-binding proteins: approached from Chou's pseudo amino acid composition and other specific sequence features. Amino Acids 2008, 34:103-109.
 Zhang SW, Zhang YL, Yang HF, Zhao CH, Pan Q. Using the concept of Chou's pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies. Amino Acids 2007, DOI 10.1007/s00726-00007-00010-00729.
 Shi JY, Zhang SW, Pan Q, Zhou GP. Using Pseudo Amino Acid Composition to Predict Protein Subcellular Location: Approached with Amino Acid Composition Distribution. Amino Acids 2007, DOI 10.1007/s00726-00007-00623-z.
 Zhou XB, Chen C, Li ZC, Zou XY. Improved prediction of subcellular location for apoptosis proteins by the dual-layer support vector machine. Amino Acids 2007, DOI 10.1007/s00726-00007-00608-y.
 Nanni L, Lumini A. Combing Ontologies and Dipeptide composition for predicting DNA-binding proteins. Amino Acids 2008, DOI 10.1007/s00726-00007-00018-00721.
 Nanni L, Lumini A. Genetic programming for creating Chou's pseudo amino acid based features for submitochondria localization. Amino Acids 2008, DOI 10.1007/s00726-00007-00016-00723.
 Eisenberg D, McLachlan AD. Solvation energy in protein folding and binding. Nature 1986, 319:199–203.
 Gromiha MM, Motohisa O, Hidetoshi K, Hatsuho U, Akinori S. Role of structural and sequence information in the prediction of protein stability changes, comparison between buried and partially buried mutations. Protein Engineering 1999, 12(7):549-555.
 Cheng J, Baldi P. A machine learning information retrieval approach to protein fold recognition. Bioinformatics 2006, 22(12):1456-63.
 Liu S, Zhang C, Liang S, Zhou Y. Fold recognition by concurrent use of solvent accessibility and residue depth. Proteins 2007, 68:636–645.