ABSTRACT Although a great deal of research has been undertaken in the area of the annotation of gene structure, predictive techniques are still not fully developed. In this paper, based on the characteristics of base composition of sequences and conservative of nucleotides at exon/intron splicing site, a least increment of diversity al-gorithm (LIDA) is developed for studying and predicting three kinds of coding exons, introns and intergenic regions. At first, by selecting the 64 trinucleotides composition and 120 position parameters of the four bases as informational parameters, coding exon, intron and intergenic sequence are predicted. The results show that overall predicted accuracies are 91.1% and 88.4%, respectively for A. thaliana and C. ele-gans genome. Subsequently, based on the po-sition frequencies of four kinds of bases in regions near intron/coding exon boundary, initia-tion and termination site of translation, 12 position parameters are selected as diversity source. And three kinds of the coding exons are predicted by use of the LIDA. The predicted successful rates are higher than 80%. These results can be used in sequence annotation.
Cite this paper
Lin, H. , Li, Q. and Chen, C. (2009) Analysis and prediction of exon, intron, intergenic region and splice sites for A. thaliana and C. elegans genomes. Journal of Biomedical Science and Engineering, 2, 367-373. doi: 10.4236/jbise.2009.26053.
 J. L. Ashurst and J. E. Collins, (2003) Gene annotation: Predic-tion and testing, Annu. Rev. Genomics Hum Genet, 4, 69–88.
M. Nowrousian, C. Würtz, S. P?ggeler, and U. Kück, (2004) Comparative sequence analysis of Sordaria macrospora and Neurospora crassa as a means to improve genome annotation, Fungal Genetics and Biology, 41, 285–292.
E. Eden and S. Brunak, (2004) Analysis and recognition of 5’UTR intron splice sites in human Pre-mRNA, Nucleic Acids Res, 32, 1131–1142.
M. Kozak, (2006) Rethinking some mechanisms invoked to explain translational regulation in eukaryotes, Gene, 382, 1–11.
H. A. Meijer and A. A. M. Thomas, (2002) Control of eu-karyotic protein synthesis by upstream open reading frames in the 5’-untranslated region of an mRNA, Biochem. J., 367, 1–11.
F. B. Guo and X. J. Yu, (2007) Re-prediction of protein-coding genes in the genome of Amsacta moorei entomopoxvirus, Journal of Virological Methods, 146, 389– 392.
F. B. Guo and C. T. Zhang, (2006) ZCURVE_V: A new self-training system for recognizing protein-coding genes in viral and phage genomes, BMC Bioinformatics, 7, 9.
Y. H. Qiao, J. L. Liu, C. G. Zhang, X. H. Xu, and Y. J. Zeng, (2005) SVM classification of human intergenic and gene se-quences, Mathematical Biosciences, 195, 168–178.
V. Brendal, L. Xing, and W. Zhu, (2004) Gene structure pre-diction from consensus spliced alignment of multiple ESTs matching the same genomic locus, Bioinformatics, 20, 1157–1169.
S. Karlin, J. Mrázek, and A. J. Gentles, (2003) Genome com-parisons and analysis, Current Opinion in Structural Biology, 13, 344–352.
S. Gopal, G. A. M. Cross, and T. Gaasterland, (2003) An or-ganism-specific method to rank predicted coding regions in Trypanosoma brucei, Nucleic. Acids Res., 31, 5877–5885.
S. D. Schlueter, Q. Dong, and V. Brendel, (2003) Gene-Seqer@PlantGDB: Gene structure prediction in plant genomes, Nucleic. Acids Res., 31, 3597–3600.
J. E. Moore and J. A. Lake, (2003) Gene structure prediction in syntenic DNA segments, Nucleic. Acids Res., 31, 7271–7279.
J. Wang, et al., (2003) Vertebrate gene predictions and problem of large genes, Nature Reviews Genetics, 4, 741–749.
F. Gao and C. T. Zhang, (2004) Comparison of various algo-rithms for recognizing short coding sequences of human genes, Bioinformatics, 20, 673–681.
M. Q. Zhang, (2002) Computational prediction of eukaryotic protein-coding genes, Nature Reviews Genetics, 3, 698–709.
Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA, J. Mol. Biol., 268, 78-94.
V. V. Solovyev, A. A. Salamov, and C. B. Lawrence, (1995) Identification of human gene structure using linear discrimi-nant functions and dynamic programming, Proc. Int. Conf. Intell. Syst. Mol. Biol., 3, 367–375.
M. G. Reese, D. Kulp, H. Tammana, and D. Haussler, (2000) Genie-Gene finding in Drosophila melanogaster, Genome. Res., 10, 529–538.
S. Rogic, A. K. Mackworth, and F. B. Ouellette, (2001) Evaluation of gene-finding programs on mammalian sequences, Genome. Res., 11, 817–832.
M. Q. Zhang, (1997) Identification of protein coding regions in human genome by quadratic discriminant ana- lysis, Proc. Natl. Acad. Sci., USA, 94, 565–568.
J. Besemer, A. Lomsadze, and M. Borodovsky, (2001) Gene-MarkS: A self-training method for prediction of gene starts in microbial genomes, implications for ?nding sequence motifs in regulatory regions, Nucleic. Acids. Res., 29, 2607–2618.
F. B. Guo, H. Y. Ou, and C. T. Zhang, (2003) ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes, Nucleic. Acids. Res., 31, 1780–1789.
E. Birney and R. Durbin, (2000) Using GeneWise in the Dro-sophila annotation experiment, Genome. Res., 10, 547–548.
M. S. Gelfand, et al., (1996) Gene recognition via spli- ced sequence alignment, Proc. Natl. Acad. Sci., USA, 93, 9061–9066.
R. F. Yeh, L. P. Lim, and C. B. Burge, (2001) Computational inference of homologous gene structures in the human genome, Genome. Res., 11, 803–816.
I. M. Meyer and R. Durbin, (2004) Gene structure conservation aids similarity based gene prediction, Nucleic. Acids. Res., 32, 776–783.
L. P. Lim and C. B. Burge, (2001) A computational analysis of sequence features involved in recognition of short introns, Proc. Natl. Acad. Sci., USA, 98, 11193– 11198.
R. R. Laxton, (1978) The measure of diversity, J. Theor. Biol., 70, 51–67.
Li, Q. Z. and Lu, Z. Q., (2001) The prediction of the structural class of protein: Application of the measure of diversity, J. Theor. Boil., 213, 493-502.
Chen, Y. L. and Li, Q. Z., (2007) Prediction of the subcellular location of apoptosis proteins, J. Theor. Biol., 245, 775-783.
Y. C. Zuo and Q. Z. L, (2009) Using K-minimum increment of diversity to predict secretory proteins of malaria parasite based on groupings of amino acids, Amino Acids, DOI 10.1007/s00726-009-0292-1.
L. R. Zhang and L. F. Luo, (2003) Splice site prediction with quadratic discriminant analysis using diversity mea- sure, Nu-cleic. Acids. Res., 31, 6214–6220.
J. Lu and L. F. Luo, (2005) Human polII promoter prediction, Prog. Biochem. Biophys., 32, 1185–1191.
H. Lin and Q. Z. Li, (2007) Predicting conotoxin superfamily and family by using pseudo amino acid composition and modi-fied Mahalanobis discriminant, Biochem. Biophys. Res. Commun., 354, 548–551.
H. Lin, and Q. Z. Li, (2007) Using pseudo amino acid compo-sition to predict protein structural class: Approached by incor-porating 400 dipeptide components, J. Comput. Chem., 28, 1463–1466.
F. M. Li and Q. Z. Li, (2008) Using pseudo amino acid com-position to predict protein subnuclear location with improved hybrid approach, Amino Acids, 34, 119–125.
X. Z. Hu and Q. Z. Li, (2008) Prediction of the β-Hairpins in proteins using support vector machine, Protein J., 27, 115–122.
H. Lin, (2008) The modified Mahalanobis Discriminant for predicting outer membrane proteins by using chou’s pseudo amino acid composition, J. Theor. Biol., 252, 350–356.
X. Z. Hu, Q. Z. Li, and C. L. Wang, (2009) Recognition of beta-hairpin motifs in proteins by using the composite vector, Amino Acids, DOI 10.1007/s00726-009-0299-7.
W. Chen and L. Luo, (2009) Classification of antimicrobial peptide using diversity measure with quadratic discriminant analysis, J. Microbiol Methods, DOI: 10.1016/ j.mimet.2009.03.013.
Y. Feng and L. Luo, (2008) Use of tetrapeptide signals for protein secondary-structure prediction, Amino Acids, 35, 607–614.
L. Luo, (2006) Information biology: Hypotheses on coding information quantity, Acta Scientiarum Naturalium Universita-tis NeiMongol, 37, 285–294.
Z. Wang, Y. Z. Chen, and Y. X. Li, (2004) A brief review of computational gene prediction methods, Geno. Prot. Bioinfo., 2, 216–221.
L. Stein, (2001) Genome annotation: From sequence to biology, Nature Rev. Genet., 2, 493–503.