ABSTRACT Gene finding, the accurate annotation of genomic DNA, has become one of the central topics in biological research. Although various computational methods (gene finders) have been proposed and developed, they all have their own limitations in gene findings. In this paper, we introduce an integrating gene finder, which combines the results of several existing gene finders together, to improve the accuracy of gene finding. Four integration schemes, based on majority voting, are developed for the analysis of two datasets – the basic dataset and the testing dataset. The basic dataset consists of 1500 DNA sequences and the testing dataset consists of 103 DNA sequences. It is demonstrated that a simple integration (a simple voting for each nucleotide) can significantly improve the finding performance, and removing confusing gene finders, caused by poor performance or redundant results, is important for a further improvement of the integration. The best prediction results are obtained using weighted majority voting, aided by the mRMR (Minimum Redundancy Maximum Relevance) (Peng, 2005) method for the gene finder selection. The prediction accuracies are 84.16% and 90.06% for the basic dataset and testing dataset respectively, which are better than any individual gene finding software in our research.
Cite this paper
nullCai, Y. , He, Z. , Hu, L. , Li, B. , Zhou, Y. , Xiao, H. , Wang, Z. , Feng, K. , Lu, L. , Feng, K. and Li, H. (2010) Gene finding by integrating gene finders. Journal of Biomedical Science and Engineering, 3, 1061-1068. doi: 10.4236/jbise.2010.311137.
 Kulp, D., Haussler, D., Reese, M.G. and Eeckman, F.H. (1996) A generalized Hidden Markov Model for the recognition of human genes in DNA. Intelligent Systems for Molecular Biology, 4(2), 134-142.
Snyder, E. and Stormo, G. (1995) Identification of protein coding regions in genomic DNA. Journal of Molecular Biology, 248, 1-18.
Borodovsky, M. and McIninch, J.G. (1993) Parallel gene recognition for both DNA strands. Computational Chemistry, 17, 123-133.
Guigo, R., Knudsen, S., Drake, N. and Smith, T.F. (1992) Prediction of gene structure. Journal of Molecular Biology, 226, 141-157.
C. Burge, S. Karlin, (1997) Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268, 78-94.
Salamov, A. and Solovyev, V. (2000) Ab initio Gene Finding in Drosophila Genomic DNA. Genome Research, 10, 516-522.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215, 403-410.
Korf, I., Flicek, P., Duan, D. and Brent, M.R. (2001) Integrating genomic homology into gene structure prediction. Bioinformatics, 17, S140-S148.
Condorcet, N.C. (1785) Essai sur l’application de l’analyse à la probabilité des decisions rendues à la pluralité des voix. Imprimerie Royale, Paris.
Huang, Y.S. and Suen, C.Y. (1995) A method of combining multiple experts for the recognition of unconstrained handwritten numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17, 90-94.
Lam, L., Huang, Y.S. and Suen, S.Y. (1997) Combination of multiple classifier decisions for optical character recognition. In: Bunke, H. and Wang, P.S.P., Eds., Handbook of Character Recognition and Document Image Analysis, World Scientific Publishing Company, New Jersey, 79-101.
Suen, C.Y., Nadal, C., Mai, T.A., Legault, R. and Lam, L. (1990) Recognition of totally unconstrained handwritten numerals based on the concept of multiple experts. Proceedings of IWFHR, Montreal, 131-143.
Stajniak, A., Szostakowski J. and Skoneczny, S. (1997) Mixed neural-traditional classifier for character recognition. Proceedings of SPIE - The International Society for Optical Engineering, 2949, 102-110.
Rahman, A.F.R., Alam, H. and Fairhurst, M.C. (2002) Multiple classifier combination for character recognition: Revisiting the majority voting system and its variation. Lecture Notes in Computer Science, 2324, 167-178.
Ho, T.K., Hull, J.J. and Srihari, S.N. (1992) Combination of decisions by multiple classifiers. In: Baird, H.S., Bunke, H. and Yamamoto, K., Eds., Structured Document Image Analysis, Secaucus, Springer-Verlag Inc., New York, 188-202.
Rohlfing, T., Russakoff, D.B. and Maurer, C.R. (2004) Performance-based classifier combination in atlas-based image segmentation using expectation-maximization parameter estimation. IEEE Transactions on Medical Imaging, 23, 983-994.
Paik, J., Jung, S. and Lee, Y. (1993) Multiple combined recognition system for automatic processing of credit card slip applications. Proceedings of the 2nd International Conference on Document Analysis and Recognition, IEEE Computer Society Press, California, 520-523.
Altincay, H. and Demirekler, M. (2000) An information theoretic framework for weight estimation in the combination of probabilistic classifiers for speaker identification. Speech Communication, 30, 255-272.
Lam, L. and Suen, C.Y. (1997) Application of majority voting to pattern recognition: An analysis of its behavior and performance. IEEE Transactions on Pattern Analysis, 27, 553-568.
Lam, L. and Suen, C.Y. (1997) A theoretical-analysis of the application of majority voting to pattern-recognition. Jerusalem, Israel, 418-420.
Rahman, A.F.R. and Fairhurst, M.C. (1997) Exploiting second order information to design a novel multiple expert decision combination platform for pattern classification. Electronics Letters, 33, 476-477.
Peng, H.C., Long, F.H. and Ding, C. (2005) Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1226-1238.
Stanke, M. and Waack, S. (2003) Gene prediction with a hidden-Markov model and a new intron Submodel. Bioinformatics, 19(Suppl. 2), ii215-ii225.
Krogh, A. (1997) Two methods for improving performance of a HMM and their application for gene finding. In: Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C. and Valencia, A., Eds., Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, 179-186.
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., et al., (2007) “ClustalW2 and ClustalX version 2,” Bioinformatics, 23, 2947-2948.
Burset, M. and Guigo, R. (1996) Evaluation of gene structure prediction programs. Genomics, 34, 353-367.