ABSTRACT Background: DNA methylation will influence the gene expression pattern and cause the changes of the genetic functions. Computational analysis of the methylation status for nucleotides can help to explore the underlying reasons for developing methylations. Results: We present a DNA sequence based method to analyze the methylation status of CpG dinucleotides using 5bp (5-mer) DNA fragments – named as the word composition encoding method. The prediction accuracy is 75.16% when all 5bp word compositions are used (totally 45 = 1024). Furthermore, 5-bp DNA fragments/words having the most impact on the methylation status are identified by mRMR (Maximum-Relevant-Minimum-Redundancy) feature selection method. As a result, 58 words are selected, and they are used to build a compact predictor, which achieves 77.45% prediction accuracy. When the word composition encoding method and the feature selection strategy are coupled together, the meaning of these words can be analyzed through their contribution towards the prediction. The biological evidence in the literature supports that the surrounding DNA sequence of the CpG dinucleotides will affect the methylation of the CpG dinucleotides. Conclusions: The main contribution of this paper is to find out and analyze the key DNA words taken from the neighbor-hood of the CpG dinucleotides that are inducing the DNA methylation.
Cite this paper
Lu, L. , Lin, K. , Qian, Z. , Li, H. , Cai, Y. and Li, Y. (2010) Predicting DNA methylation status using word composition. Journal of Biomedical Science and Engineering, 3, 672-676. doi: 10.4236/jbise.2010.37091.
 Tost, J., Schatz, P., Schuster, M., Berlin, K. and Gut, I.G. (2003) Analysis and accurate quantification of CpG methylation by MALDI mass spectrometry. Nucleic Acids Research, 31(9), e50.
Klose, R.J. and Bird, A.P. (2006) Genomic DNA methylation: the mark and its mediators. Trends in Biochemical Sciences, 31(2), 89-97.
Watt, F. and Molloy, P.L. (1988) Cytosine methylation prevents binding to DNA of a HeLa cell transcription factor required for optimal expression of the adenovirus major late promoter. Genes & Development, 2(9), 1136-1143.
Boyes, J. and Bird, A. (1991) DNA methylation inhibits transcription indirectly via a methyl-CpG binding protein. Cell, 64(6), 1123-1134.
Hendrich, B. and Bird, A. (1998) Identification and characterization of a family of mammalian methyl-CpG binding proteins. Molecular and Cellular Biology, 18(11), 6538-6547.
Rakyan, V.K., Hildmann, T., Novik, K.L., Lewin, J., Tost, J., Cox, A.V., Andrews, T.D., Howe, K.L., Otto, T., Olek, A., et al. (2004) DNA methylation profiling of the human major histocompatibility complex: a pilot study for the human epigenome project. PLoS Biology, 2(12), e405.
Schulz, W.A. (1998) DNA methylation in urological malignancies (review). International Journal of Oncology, 13(1), 151-167.
Ushijima, T. (2005) Detection and interpretation of altered methylation patterns in cancer cells. Nature Reviews, 5(2), 223-231.
Agrawal, A., Murphy, R.F. and Agrawal, D.K. (2007) DNA methylation in breast and colorectal cancers. Modern Pathology, 20(7), 711-721.
Robertson, K.D., Manns, A., Swinnen, L.J., Zong, J.C., Gulley, M.L. and Ambinder, R.F. (1996) CpG methylation of the major Epstein-Barr virus latency promoter in Burkitt’s lymphoma and Hodgkin’s disease. Blood, 88(8), 3129-3136.
Chan, A.O. and Rashid, A. (2006) CpG island methylation in precursors of gastrointestinal malignancies. Current Molecular Medicine, 6(4), 401-408.
Zhu, J.Q., Liu, J.H., Liang, X.W., Xu, B.Z., Hou, Y., Zhao, X.X. and Sun, Q.Y. (2008) Heat stress causes aberrant DNA methylation of h19 and igf-2r in mouse blastocysts. Molecules and Cells, 25(2), 211-215.
Bandaru, B., Gopal, J. and Bhagwat, A.S. (1996) Overproduction of DNA cytosine methyltransferases causes methylation and C --> T mutations at non-canonical sites. The Journal of Biological Chemistry, 271, 7851-7859.
Zhang, J.C., Lu, J., Li, H.P., Wu, J.M. and Hu, L.H. (2006) High Rate of P16 Methylation Associated with Hepatitis B Virus Infection in Hepatocellular Carcinoma. The Chinese-German Journal of Clinical Oncology, 5, 84-89.
Bhasin, M., Zhang, H., Reinherz, E.L. and Reche, P.A. (2005) Prediction of methylated CpGs in DNA sequences using a support vector machine. FEBS Letters, 579(20), 4302-4308.
Chou, K.C. and Cai, Y.D. (2006) Predicting protein-pro- tein interactions from sequences in a hybridization space. Journal of Proteome Research, 5(2), 316-322.
Peng, H., Long, F. and Ding, C. (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3), 1226-1238.
Madsen, A. and Josephsen, J. (1998) Cloning and characterization of the lactococcal plasmid-encoded type II restriction/modification system, LlaDII. Applied and Environmental Microbiology, 64(7), 2424-2431.
Chernukhin, V.A., Kashirina, Y.G., Sukhanova, K.S., Abdurashitov, M.A., Gonchar, D.A. and Degtyarev, S. (2005) Isolation and characterization of biochemical properties of DNA methyltransferase FauIA modifying the second cytosine in the nonpalindromic sequence 5’-CCCGC-3’. Biochemistry, 70(6), 685-691.
Eckhardt, F., Lewin, J., Cortese, R., Rakyan, V.K., Attwood, J., Burger, M., Burton, J., Cox, T.V., Davies, R., Down, T.A., et al. (2006) DNA methylation profiling of human chromosomes 6, 20 and 22. Nature Genetics, 38(12), 1378-1385.
Das, R., Dimitrova, N., Xuan, Z., Rollins, R.A., Haghighi, F., Edwards, J.R., Ju, J., Bestor, T.H. and Zhang, M.Q. (2006) Computational prediction of methylation status in human genomic sequences. Proceedings of the National Academy of Sciences of the United States of America, 103(28), 10713-10716.
Li, W. and Godzik, A. (2006) Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (Oxford, England), 22(13), 1658-1659.
Ding, C. and Peng, H. (2005) Minimum redundancy feature selection from micro array gene expression data. Journal of Bioinformatics and Computational Biology, 3(2), 185-205.