ABSTRACT Transcription Terminators (TTs) play an impor-tant role in bacterial RNA transcription. Some bacteria are known to have Species-Specific Subsequences (SSS) in their TTs, which are be-lieved containing useful clues to bacterial evolu-tion. The SSS can be identified using biological methods which, however, tend to be costly and time-consuming due to the vast number of sub-sequences to experiment on. In this paper, we study the problem from a computational per-spective and propose a computing method to identify the SSS. Given DNA sequences of a tar-get species, some of which are known to contain a TT while others not, our method uses machine learning techniques and is done in three steps. First, we find all frequent subsequences from the given sequences, and show that this can be effi-ciently done using generalized suffix trees. Sec-ond, we use these subsequences as features to characterize the original DNA sequences and train a classification model using Support Vector Machines (SVM), one of the currently most effec-tive machine learning techniques. Using the pa-rameters of the resulting SVM model, we define a measure called subsequence specificity to rank the frequent subsequences, and output the one with the highest rank as the SSS. Our experi-ments show that the SSS found by the proposed method are very close to those determined by biological experiments. This suggests that our method, though purely computational, can help efficiently locate the SSS by effectively narrowing down the search space.
Cite this paper
nullGu, B. and Sun, Y. (2009) Identifying species-specific subsequences in bacteria transcription terminators-A machine learning approach. Journal of Biomedical Science and Engineering, 2, 184-189. doi: 10.4236/jbise.2009.23031.
 P. Turner, (2000) Molecular Biology.
M. D. Ermolaeva, H. G. Khalak, O. White, H. O. Smith, and S. L. Salzberg, (2000) Prediction of transcription terminators in bacterial genomes, Journal of Molecular Biology, 301, 27-33.
T. Davidsen, E. A. Rodland, K. Lagesen, E. Seeberg, and T. Rognes, (2004) Biased distribution of dna uptake sequences towards genome maintenance genes, Nucleic Acids Research, 32(3), 1050-1058.
C. J. C. Burges, (1998) A tutorial on support vector machines for pattern recognition, Knowledge Discovery and Data Mining, 2(2).
T. Joachims, (2002) Optimizing search engines using clickthrough data, in Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD-2002).
V. Vapnik, (1995) The Nature of Statistical Learning Theory. Springer.
S. Rong, F. Chen, K. Wang, M. Ester, J. L. Gardy, and F. S. L. Brinkman, (2003) Frequent-subsequence-based prediction of outer membrane proteins, in Proceedings of 2003 ACM SIGKDD Conference.
M. Deshpande and G. Karypis, (2002) Evaluation of techniques for classifying biological sequences, in Proceedings of Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2002).
D. Gusfield, (1997) Algorithms on strings, trees, and sequences: computer science and computational biology, Cambridge University Press.
G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval, Information Processing and Management, 24(5), 1988.
T. Joachims, (1998) Text categorization with support vector machines: Learning with many relevant features, in Proceedings of the European Conference on Machine Learning (ECML-1998).
(2002) Svmlight support vector machine, web.
B. Gu, (2007) Discovering species-specific transcription terminators for bacteria, School of Computing Science, Simon Fraser University, Tech. Rep.
J. R. Quinlan, (1993) C4.5: Programs for machine learning, Morgan Kaufmann Publisher.
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, (1984) Classification and regression trees, Wadsworth.
H. O. Lancaster, (1969) The chi-squared distribution, John & Sons.