NS  Vol.1 No.2 , September 2009
Sequence-Based Protein Crystallization Propensity Prediction for Structural Genomics: Review and Comparative Analysis
Abstract: Structural genomics (SG) is an international effort that aims at solving three-dimensional shapes of important biological macro-molecules with primary focus on proteins. One of the main bottlenecks in SG is the ability to produce dif-fraction quality crystals for X-ray crystallogra-phy based protein structure determination. SG pipelines allow for certain flexibility in target selection which motivates development of in- silico methods for sequence-based prediction/ assessment of the protein crystallization pro-pensity. We overview existing SG databanks that are used to derive these predictive models and we discuss analytical results concerning protein sequence properties that were discov-ered to correlate with the ability to form crystals. We also contrast and empirically compare mo- dern sequence-based predictors of crystalliza-tion propensity including OB-Score, ParCrys, XtalPred and CRYSTALP2. Our analysis shows that these methods provide useful and compli-mentary predictions. Although their average ac- curacy is similar at around 70%, we show that application of a simple majority-vote based en-semble improves accuracy to almost 74%. The best improvements are achieved by combining XtalPred with CRYSTALP2 while OB-Score and ParCrys methods overlap to a larger extend, although they still complement the other two predictors. We also demonstrate that 90% of the protein chains can be correctly predicted by at least one of these methods, which suggests that more accurate ensembles could be built in the future. We believe that current protein crystalli-zation propensity predictors could provide useful input for the target selection procedures utilized by the SG centers.
Cite this paper: Kurgan, L. and J. Mizianty, M. (2009) Sequence-Based Protein Crystallization Propensity Prediction for Structural Genomics: Review and Comparative Analysis. Natural Science, 1, 93-106. doi: 10.4236/ns.2009.12012.

[1]   Guido, R.V., Oliva, G. and Andricopulo, A.D. (2008) Virtual screening and its integration with modern drug design technologies. Current Medicinal Chemistry, 15(1), 37-46.

[2]   Norin, M. and Sundstr?m, M. (2001) Protein models in drug discovery. Current Opinion in Drug Discovery & Development, 4, 284-290.

[3]   Klebe, G. (2000) Recent developments in structure-based drug design. Journal of Molecular Medicine, 78(5), 269-281.

[4]   Fernàndez-Busquets, X., de Groot, N.S., Fernandez, D. and Ventura, S. (2008) Recent structural and computa-tional insights into conformational diseases. Current Me-dicinal Chemistry, 15, 1336-1349.

[5]   Luscombe, N.M., Laskowski, R.A. and Thornton, J.M. (2001) Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Research, 29, 2860-2874.

[6]   Ellis, J.J., Broom, M. and Jones, S. (2007) Protein-RNA interactions: structural analysis and functional classes. Proteins, 66, 903-911.

[7]   Chen, K. and Kurgan, L. (2009) Investigation of atomic level patterns in protein - small ligand interactions. PLoS ONE, 4(2), e4473.

[8]   Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2007) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research, 35(Database issue), D61-65.

[9]   Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E. (2000) The Protein Data Bank. Nucleic Acids Research, 28, 235-242.

[10]   Brenner, S.E. (2001) A tour of structural genomics. Na-ture Reviews Genetics, 2(10), 801-809.

[11]   Chandonia, J.M. and Brenner, S.E. (2006) The impact of structural genomics: expectations and outcomes. Science, 311, 347-351.

[12]   Service, R.F. (2008) Protein Structure Initiative: Phase 3 or Phase Out. Science, 319(5870), 1610-1613.

[13]   Terwilliger, T.C., Waldo, G., Peat, T.S., Newman, J.M., Chu, K. and Berendzen, J. (1998) Class-directed struc-ture determination: Foundation for a protein structure initiative. Protein Science, 7(9), 1851-1856.

[14]   Brenner, S.E. (2000) Target selection for structural ge-nomics. Nature Structural Biology, 7, 967-969.

[15]   Dessailly, B.H., Nair, R., Jaroszewski, L., Fajardo, J.E., Kouranov, A., Lee, D., Fiser, A., Godzik, A., Rost, B. and Orengo, C. (2009) PSI-2: structural genomics to cover protein domain family space. Structure, 17(6), 869-881.

[16]   Ilari, A. and Savino, C. (2008) Protein structure determi-nation by x-ray crystallography. Methods in Molecular Biology, 452, 63-87.

[17]   Wishart, D. (2005) NMR spectroscopy and protein structure determination: applications to drug discovery and development. Current Pharmaceutical Biotechnol-ogy, 6(2), 105-120.

[18]   Hite, R.K., Raunser, S. and Walz, T. (2007) Revival of electron crystallography. Current Opinion in Structural Biology, 17(4), 389-395.

[19]   Fischer, D. (2006) Servers for protein structure prediction. Current Opinion in Structural Biology, 16(2), 178-182.

[20]   Xiang, Z. (2006) Advances in homology protein structure modeling. Current Protein & Peptide Science, 7(3), 217-227.

[21]   Lacapère, J.J., Pebay-Peyroula, E., Neumann, J.M. and Etchebest, C. (2007) Determining membrane protein structures: still a challenge! Trends in Biochemical Sci-ences, 32(6), 259-270.

[22]   Schnell, J.R. and Chou, J.J. (2008) Structure and mecha-nism of the M2 proton channel of influenza A virus. Na-ture, 451, 591-595.

[23]   Service, R. (2005) Structural genomics, round 2. Science, 307, 1554-1558.

[24]   Chen, L., Oughtred, R., Berman, H.M. and Westbrook, J. (2004) TargetDB: a target registration database for struc-tural genomics projects. Bioinformatics, 20(16), 2860-2862.

[25]   Slabinski, L., Jaroszewski, L., Rychlewski, L., Wilson, I.A., Lesley, S.A. and Godzik, A. (2007) XtalPred: a web server for prediction of protein crystallizability. Bioin-formatics, 23(24), 3403-3405.

[26]   Hui, R. and Edwards, A. (2003) High-throughput protein crystallization. Journal of Structural Biology, 142, 154-161.

[27]   Savchenko, A., Yee, A., Khachatryan, A., Skarina, T., Evdokimova, E., Pavlova, M., Semesi, A., Northey, J., Beasley, S., Lan, N., Das, R., Gerstein, M., Arrowmith, C.H. and Edwards, A.M. (2003) Strategies for structural proteomics of prokaryotes: quantifying the advantages of studying orthologous proteins and of using both NMR and x-ray crystallography approaches. Proteins, 50, 392-399.

[28]   Chandonia, J.M. and Brenner, S.E. (2005) Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches. Pro-teins, 58, 166-179.

[29]   McPherson, A. (2004) Protein crystallization in the structural genomics era. Journal of Structural and Func-tional Genomics, 5(1-2), 3-12.

[30]   Chayen, N.E. (2004) Turning protein crystallisation from an art into a science. Current Opinion in Structural Biol-ogy, 14(5), 577-583.

[31]   Biertumpfel, C., Basquin, J. and Suck, D. (2005) Practi-cal implementations for improving the throughput in a manual crystallization setup. Journal of Applied Crystal-lography, 38, 568-570.

[32]   Puesy, M., Liu, Z.J., Tempel, W., Praissman, J., Lin, D., Wang, B.C., Gavira, J.A. and Ng, J.D. (2005) Life in the fast lane for protein crystallization and X-ray crystallog-raphy. Progress in Biophysics and Molecular Biology, 88, 359-386.

[33]   Stevens, R.C. (2000) High-throughput protein crystalli-zation. Current Opinion in Structural Biology, 10(5), 558-63.

[34]   Rodrigues, A. and Hubbard, R.E. (2003) Making deci-sions for structural genomics. Briefings in Bioinformatics, 4, 150-167.

[35]   Lesley, S.A., Kuhn, P., Godzik, A., Deacon, A.M., Mathews, I., Kreusch, A., Spraggon, G., Klock, H.E., McMullan, D., Shin, T., Vincent, J., Robb, A., Brinen, L.S., Miller, M.D., McPhillips, T.M., Miller, M.A., Scheibe, D., Canaves, J.M., Guda, C., Jaroszewski, L., Selby, T.L., Elsliger, M.A., Wooley, J., Taylor, S.S., Hodgson, K.O., Wilson, I.A., Schultz, P.G., Stevens, R.C. (2002) Structural genomics of the Thermotoga maritima proteome implemented in a high-throughput structure determination pipeline. Pr

[36]   Brenner, S.E., Barken, D. and Levitt, M. (1999) The PRESAGE database for structural genomics. Nucleic Acids Research, 27(1), 251-253.

[37]   Chance, M.R., Bresnick, A.R. Burley, S.K., Jiang, J.S., Lima, C.D., Sali, A., Almo, S.C., Bonanno, J.B., Buglino, J.A., Boulton, S., Chen, H., Eswar, N., He, G., Huang, R., Ilyin, V., McMahan, L., Pieper, U., Ray, S., Vidal, M., Wang, L.K. (2002) Structural genomics: pipeline for pro-viding structures for the biologist, Protein Science, 11(4), 723-738.

[38]   Bertone, P., Kluger, Y., Lan, N., Zheng, D., Christendat, D., Yee, A., Edwards, A.M., Arrowsmith, C.H., Mon-telione, G.T. and Gerstein, M. (2001) SPINE: An inte-grated tracking database and data mining approach for identifying feasible targets in high-throughput structural proteomics. Nucleic Acids Research, 29, 2884-2898.

[39]   Goh, C.S., Lan, N., Echols, N., Douglas, S.M., Milburn, D., Bertone, P., Xiao, R., Ma, L.C., Zheng, D., Wunder-lich, Z., Acton, T., Montelione, G.T. and Gerstein, M. (2003) SPINE 2: a system for collaborative structural proteomics within a federated database framework. Nu-cleic Acids Research, 31, 2833-2838.

[40]   Kouranov, A., Xie, L., de la Cruz, J., Chen, L., West-brook, J., Bourne, P.E. and Berman, H.M. (2006) The RCSB PDB information portal for structural genomics. Nucleic Acids Research, 4(Database issue), D302-305.

[41]   Berman, H.M. (2008) Harnessing knowledge from structural genomics. Structure, 16, 16-18.

[42]   Berman, H.M., Westbrook, J.D., Gabanyi, M.J., Tao, W., Shah, R., Kouranov, A., Schwede, T., Arnold, K., Kiefer, F., Bordoli, L., Kopp, J., Podvinec, M., Adams, P.D., Carter, L.G., Minor, W., Nair, R. and La Baer, J. (2008) The protein structure initiative structural genomics knowledgebase. Nucleic Acids Research, 37(Database issue), D365-368.

[43]   Rupp, B. and Wang, J.W. (2004) Predictive models for protein crystallization. Methods, 34, 391-408.

[44]   Christendat, D., Yee, A., Dharamsi, A., Kluger, Y., Savchenko, A., Cort, J.R., Booth, V., Mackereth, C.D., Saridakis, V., Ekiel, I., Kozlov, G., Maxwell, K.L., Wu, N., McIntosh, L.P., Gehring, K., Kennedy, M.A., David-son, A.R., Pai, E.F., Gerstein, M., Edwards, A.M., Ar-rowsmith, C.H. (2000) Structural proteomics of an ar-chaeon. Nature Structural Biology, 7, 903-909.

[45]   Goh, C.S., Lan, N., Douglas, S.M., Wu, B., Echols, N., Smith, A., Milburn, D., Montelione, G.T., Zhao, H. and Gerstein, M. (2004) Mining the structural genomics pipeline: Identification of protein properties that affect high-throughput experimental analysis. Journal of Mo-lecular Biology, 336, 115-130.

[46]   Canaves, J.M., Page, R., Wilson, I.A. and Stevens, R.C. (2004) Protein biophysical properties that correlate with crystallization success in Thermotoga maritima: Maxi-mum clustering strategy for structural genomics. Journal of Molecular Biology, 344, 977-991.

[47]   Kantardjieff, K.A. and Rupp, B. (2004) Protein isoelec-tric point as a predictor for increased crystallization screening efficiency. Bioinformatics, 20, 2162-2168.

[48]   Kantardjieff, K.A., Jamshidian, M. and Rupp, B. (2004) Distributions of pI vs pH provide strong prior informa-tion for the design of crystallization screening experi-ments. Bioinformatics, 20, 2171-2174.

[49]   Longenecker, K.L., Garrard, S.M., Sheffield, P.J. and Derewenda, Z.S. (2001) Protein crystallization by ra-tional mutagenesis of surface residues: Lys to Ala muta-tions promote crystallization of RhoGDI. Acta Crystal-lographica Section D: Biological Crystallography, 57, 679-688.

[50]   Mateja, A., Devedjiev, Y., Krowarsch, D., Longenecker, K., Dauter, Z., Otlewski, J., Derewenda, Z.S. (2002) The impact of Glu-Ala and Glu-Asp mutations on the crystal-lization properties of RhoGDI: the structure of RhoGDI at 1.3 A resolution. Acta Crystallographica Section D: Biological Crystallography, 58, 1983-1991.

[51]   Derewenda, Z.S. (2004) The use of recombinant methods and molecular engineering in protein crystallization. Methods, 34, 354-363.

[52]   Derewenda, Z.S. (2004) Rational protein crystallization by mutational surface engineering. Structure, 12, 529-535.

[53]   Derewenda, Z.S. and Vekilov, P.G. (2006) Entropy and surface engineering in protein crystallization. Acta Crys-tallographica Section D: Biological Crystallography, 62, 116-124.

[54]   Cooper, D.R., Boczek, T., Grelewska, K., Pinkowska, M., Sikorska, M., Zawadzki, M. and Derewenda, Z. (2007) Protein crystallization by surface entropy reduction: op-timization of the SER strategy. Acta Crystallographica Section D: Biological Crystallography, 63, 636-645.

[55]   Goldschmidt, L., Cooper, D.R., Derewenda, Z. and Eisenberg, D. (2007) Toward rational protein crystalliza-tion: A Web server for the design of crystallizable protein variants. Protein Science, 16, 1569-1576.

[56]   Oldfield, C.J., Ulrich, E.L., Cheng, Y., Dunker, A.K. and Markley, J.L. (2005) Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins, 59, 444-453.

[57]   Chandonia, J.M., Kim, S.H. and Brenner, S.E. (2006) Target selection and deselection at the Berkeley Struc-tural Genomics Center. Proteins, 62, 356-370.

[58]   Price, W.N. 2nd, Chen, Y., Handelman, S.K., Neely, H., Manor, P., Karlin, R., Nair, R., Liu, J., Baran, M., Everett, J., Tong, S.N., Forouhar, F., Swaminathan, S.S., Acton, T., Xiao, R., Luft, J.R., Lauricella, A., DeTitta, G.T., Rost, B., Montelione, G.T. and Hunt, J.F.. (2009) Understanding the physical properties that control protein crystallization by analysis of large-scale experimental data. Nature Bio-technology, 27(1), 51-57.

[59]   Chou, K.C. (2004) Structural bioinformatics and its im-pact to biomedical science. Current Medicinal Chemistry, 11, 2105-2134.

[60]   Chou, K.C. (2005) Progress in protein structural class prediction and its impact to bioinformatics and pro-teomics. Current Protein & Peptide Science, 6, 423-436.

[61]   Yang, Z. R., Wang, L., Young, N. and Chou, K.C. (2005) Pattern recognition methods for protein functional site prediction. Current Protein & Peptide Science, 6, 479-491.

[62]   Chou, K.C. and Shen, H.B. (2007) Recent progresses in protein subcellular location prediction. Analytical Bio-chemistry, 370, 1-16.

[63]   Kurgan, L., Cios, K.J., Zhang, H., Zhang, T., Chen, K., Shen, S. and Ruan, J. (2008) Sequence-based methods for real value predictions of protein structure. Current Bioinformatics, 3(3), 183-196.

[64]   Smialowski, P., Schmidt, T., Cox, J., Kirschner, A. and Frishman, D. (2006) Will my protein crystallize? A se-quence-based predictor. Proteins, 62, 343-355.

[65]   Overton, I.M. and Barton, G.J. (2006) A normalised scale for structural genomics target ranking: the OB-Score. FEBS Letters, 580, 4005-4009.

[66]   Overton, I.M., Padovani, G., Girolami, M.A. and Barton, G.J. (2008) ParCrys: a Parzen window density estimation approach to protein crystallization propensity prediction. Bioinformatics, 24, 901-907.

[67]   Slabinski, L., Jaroszewski, L., Rodrigues, A.P.C., Rychlewski, L., Wilson, I.A., Lesley, S.A. and Godzik, A. (2007) The challenge of protein structure determination - lessons from structural genomics. Protein Science, 16(11), 2472-2482.

[68]   Slabinski, L., Jaroszewski, L., Rychlewski, L., Wilson, I.A., Lesley, S.A. and Godzik, A. (2007) XtalPred: a web server for prediction of protein crystallizability. Bioin-formatics, 23(24), 3403-3405.

[69]   Chen, K., Kurgan, L. and Rahbari, M. (2007) Prediction of protein crystallization using collocation of amino acid pairs. Biochemical and Biophysical Research Communi-cations, 355, 764-769.

[70]   Kurgan, L., Razib, A.A., Aghakhani, S., Dick, S., Mizianty, M.J. and Jahandideh, S. (2009) CRYSTALP2: sequence-based protein crystallization propensity predic-tion. BMC Structural Biology, 9, 50.

[71]   Campbell, K. and Kurgan, L. (2008) Sequence-only based prediction of ?-turn location and type using collo-cation of amino acid pairs. Open Bioinformatics Journal, 2, 37-49.

[72]   Chen, K., Kurgan, L. and Ruan, J. (2007) Prediction of flexible/rigid regions in proteins from sequences using collocated amino acid pairs. BMC Structural Biology, 7, 25.

[73]   Chen, Y.Z., Tang, Y.R., Sheng, Z.Y. and Zhang, Z. (2008) Prediction of mucin-type O-glycosylation sites in mam-malian proteins using the composition of k-spaced amino acid pairs. BMC Bioinformatics, 9, 101.

[74]   Chen, K., Jiang, Y., Du, L. and Kurgan, L. (2009) Predic-tion of integral membrane protein type by collocated hy-drophobic amino acid pairs. Journal of Computational Chemistry, 30(1), 163-172.

[75]   Kurgan L. (2008) On the relation between the predicted secondary structure and the protein size. The Protein Journal, 24(4), 234-239.

[76]   Shen, H.B. and Chou, K.C. (2009) Predicting protein fold pattern with functional domain and sequential evo-lution information. Journal of Theoretical Biology, 256(3), 441-446.

[77]   Chen, K. and Kurgan, L. (2007) PFRES: protein fold classification by using evolutionary information and pre-dicted secondary structure. Bioinformatics, 23(21), 2843-2850.

[78]   Assfalg, J., Gong, J., Kriegel, H.P., Pryakhin, A., Wei, T. and Zimek, A. (2009) Supervised ensembles of predic-tion methods for subcellular localization. Journal of Bio-informatics and Computational Biology, 7(2), 269-285.

[79]   Shen, H.B. and Chou, K.C. (2007) Hum-mPLoc: an en-semble classifier for large-scale human protein subcellu-lar location prediction by incorporating samples with multiple sites. Biochemical and Biophysical Research Communications, 355(4), 1006-1011.

[80]   Chou, K.C. and Shen, H. B. (2006) Hum-PLoc: A novel ensemble classifier for predicting human protein subcel-lular localization. Biochemical and Biophysical Research Communications, 347, 150-157.

[81]   Kedarisetti, K.D., Kurgan, L. and Dick, S. (2006) Classi-fier ensembles for protein structural class prediction with varying homology. Biochemical and Biophysical Re-search Communications, 348(3), 981-988.

[82]   Chen, H. and Zhou, H.X. (2005) Prediction of solvent accessibility and sites of deleterious mutations from pro-tein sequence. Nucleic Acids Research, 33(10), 3193-3199.