JBiSE  Vol.3 No.7 , July 2010
A novel voting system for the identification of eukaryotic genome promoters
ABSTRACT
Motivation: Accurate identification and delineation of promoters/TSSs (transcription start sites) is important for improving genome annotation and devising experiments to study and understand transcriptional regulation. Many promoter identifiers are developed for promoter identification. However, each promoter identifier has its own focuses and limitations, and we introduce an integration scheme to combine some identifiers together to gain a better prediction performance. Result: In this contribution, 8 promoter identifiers (Proscan, TSSG, TSSW, FirstEF, eponine, ProSOM, EP3, FPROM) are chosen for the investigation of integration. A feature selection method, called mRMR (Minimum Redundancy Maximum Relevance), is novelly transferred to promoter identifier selection by choosing a group of robust and complementing promoter identifiers. For comparison, four integration methods (SMV, WMV, SMV_IS, WMV_IS), from simple to complex, are developed to process a training dataset with 1400 se- quences and a testing dataset with 378 sequences. As a result, 5 identifiers (FPROM, FirstEF, TSSG, epo- nine, TSSW) are chosen by mRMR, and the integration of them achieves 70.08% and 67.83% correct prediction rates for a training dataset and a testing dataset respectively, which is better than any single identifier in which the best single one only achieves 59.32% and 61.78% for the training dataset and testing dataset respectively.

Cite this paper
nullLei, L. , Feng, K. , He, Z. and Cai, Y. (2010) A novel voting system for the identification of eukaryotic genome promoters. Journal of Biomedical Science and Engineering, 3, 719-726. doi: 10.4236/jbise.2010.37096.
References
[1]   Abeel, T., Saeys, Y., Bonnet, E., Rouze, P. and Van de Peer, Y. (2008) Generic eukaryotic core promoter prediction using structural features of DNA. Genome Research, 18(2), 310-323.

[2]   Abeel, T., Saeys, Y., Rouze, P. and Van de Peer, Y. (2008) ProSOM: Core promoter prediction based on unsupervised clustering of DNA physical profiles. Bioinformatics, 24(13), i24-31.

[3]   Davuluri, R.V., Grosse, I. and Zhang, M.Q. (2001) Computational identification of promoters and first exons in the human genome. Nature Genetics, 29(4), 412-417.

[4]   Down, T.A. and Hubbard, T.J. (2002) Computational detection and location of transcription start sites in ma- mmalian genomic DNA. Genome Research, 12(3), 458- 461.

[5]   Prestridge, D.S. (1995) Predicting Pol II promoter sequences using transcription factor binding sites. Journal of Molecular Biology, 249(5), 923-932.

[6]   Solovyev, V.V. and Shahmuradov, I.A. (2003) PromH: Promoters identification using orthologous genomic sequences. Nucleic Acid Research, 31(13), 3540-3545.

[7]   Solovyev, V.V. and Salamov, A. (1997) The Gene-Finder computer tools for analysis of human and model organism genome sequences. The Fifth International Conference on Intelligent Systems for Molecular Biology, 294- 302.

[8]   Werner, T. (1999) Models for prediction and recognition of eukaryotic promoters. Mamm Genome, 10(2), 168- 175.

[9]   Altincay, H. and Demirekler, M. (2000) An information theoretic framework for weight estimation in the com- bination of probabilistic classifiers for speaker identification. Speech Communication, 30(4), 255-272.

[10]   Liu, R. and States, D.J. (2002) Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling. Genome Research, 12(3), 462-469.

[11]   Lam, L. and Suen, C.Y. (1994) A theoretical-analysis of the application of majority voting to pattern-recognition. 12th IAPR International Conference on Pattern Recognition, Jerusalem, Israel, 418-420.

[12]   Lam, L. and Suen, C.Y. (1997) Application of majority voting to pattern recognition: An analysis of its behavior and performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 553-568.

[13]   Stajniak, A., Szostakowski, J. and Skoneczny, S. (1997) Mixed neural-traditional classifier for character recognition. SPIE-International Society for Optical Engineering, 2949, 102-110.

[14]   Huang, Y.S. and Suen, C.Y. (1995) A method of combining multiple experts for the recognition of unconstrained handwritten numerals. IEEE Transactions on Pattern An- alysis and Machine Intelligence, 17(1), 90-94.

[15]   Lam, L., Huang, Y.S. and Suen, C.Y. (1997) Combination of multiple classifier decisions for optical character recognition. In: Handbook of Character Recognition and Document Image Analysis, Edited by Bunke, H. and Wang, P.S.P., World Scientific Publishing Company, New Jersey, 79-101.

[16]   Rahman, A.F.R., Alam, H. and Fairhurst, M.C. (2002) Multiple Classifier Combination for Character Recognition: Revisiting the Majority Voting System and Its Variation. In: Lecture Notes in Computer Science, Spri- nger Berlin/Heidelberg, 2423, 319-328.

[17]   Suen, C.Y., Nadal, C., Mai, T.A., Legault, R. and Lam, L. (1990) Recognition of totally unconstrained handwritten numerals based on the concept of multiple experts. In: International Workshop Frontiers in Handwriting Recognition, Montreal.

[18]   Ho, T.K., Hull, J.J. and Srihari, S.N. (1992) Combination of Decisions by Multiple Classifiers. In: Structured Document Image Analysis, Edited by Baird, H.S., Bunke, H., Yamamoto, K., Springer Verlag New York, Inc., NewJersy, 188-202.

[19]   Rahman, A.F.R. and Fairhurst, M.C. (1997) Exploiting second order information to design a novel multiple expert decision combination platform for pattern classification. Electronics Letters, 33(6), 476-477.

[20]   Rohlfing, T., Russakoff, D.B. and Maurer, C.R. (2004) Performance-Based Classifier Combination in Atlas- Based Image Segmentation Using Expectation-Maxi- mization Parameter Estimation. IEEE Transactions on Medical Imaging, 23(8), 983-994.

[21]   Paik, J., Jung, S. and Lee, Y. (1993) Multiple combined recognition system for automatic processing of credit card slip applications. In: The Second International Conference on Document Analysis and Recognition, IEEE Computer Society Press, Washington, 520-523.

[22]   Won, H.H., Kim, M.J., Kim, S. and Kim, J.W. (2008) EnsemPro: an ensemble approach to predicting transcription start sites in human genomic DNA sequences. Genomics, 91(3), 259-266.

[23]   Peng, H., Long, F. and Ding, C. (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1226-1238.

[24]   Bucher, P., Périer, R.C., Praz, V. and Schmid, C. (2006) The eukaryotic promoter database user manual. Nucleic Acid Research, 34, D82-85.

 
 
Top