Health  Vol.1 No.1 , June 2009
Incorporating heterogeneous biological data sources in clustering gene expression data
Abstract: In this paper, a similarity measure between genes with protein-protein interactions is pro-posed. The chip-chip data are converted into the same form of gene expression data with pear-son correlation as its similarity measure. On the basis of the similarity measures of protein- protein interaction data and chip-chip data, the combined dissimilarity measure is defined. The combined distance measure is introduced into K-means method, which can be considered as an improved K-means method. The improved K-means method and other three clustering methods are evaluated by a real dataset. Per-formance of these methods is assessed by a prediction accuracy analysis through known gene annotations. Our results show that the improved K-means method outperforms other clustering methods. The performance of the improved K-means method is also tested by varying the tuning coefficients of the combined dissimilarity measure. The results show that it is very helpful and meaningful to incorporate het-erogeneous data sources in clustering gene expression data, and those coefficients for the genome-wide or completed data sources should be given larger values when constructing the combined dissimilarity measure.
Cite this paper: nullLi, G. and Wang, Z. (2009) Incorporating heterogeneous biological data sources in clustering gene expression data. Health, 1, 17-23. doi: 10.4236/health.2009.11004.

[1]   D. Lockhart and E. Winzeler, (2000) Genomics gene expression and DNA arrays, Nature, 405, 827-846.

[2]   M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, (1998) Cluster analysis and display of genome- wide expression patterns, Proc. Natl Acad. Sci. USA, 95, 14863-14868.

[3]   S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, (1999) Systematic determination of genetic network architecture, Nature Genetics, 22, 281-285.

[4]   P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub, (1999) Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation, Proc. Natl Acad. Sci. USA, 96, 2907–2912.

[5]   S. Raychaudhuri, J. M. Stuart, and R. B. Altman, (2000) Principal component analysis to summarize microarray experiments: Application to sporulation time series, Pac. Symp. Biocomput., 455–466.

[6]   M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Jr. Ares, and D. Haussler, (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines, Proc. Natl Acad. Sci, 97, 262–267.

[7]   H. Xia, A. Panaye, and B. T. Fan, (2007) Nonlinear SVM approaches to QSPR/QSAR studies and drug design, Current Computer-Aided Drug Design, 3, 341–352.

[8]   D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer, (2002) Coclustering of biological networks and gene expression data, Bioinformatics, 18, 145–154.

[9]   J. Kasturi and R. Acharya, (2005) Clustering of diverse genomic data using information fusion, Bioinformatics, 21, 423–429.

[10]   K. Rafal and Z. Adam, (2006) Incorporating gene ontology in clustering gene expression data, CBMS’06.

[11]   L. Kaufman and P. Rousseeuw, (1990) Finding groups in data: An introduction to cluster analysis, Wiley, New York.

[12]   S. Chris, B. Bobby-Joe , R. Teresa, B. Lorrie, B. Ashton, and T. Mike, (2006) BioGRID: A general repository for interaction datasets, Nucleic Acids Research, Database issue, 34, D535–D539.

[13]   X. loannis, F. Esteban, S. Lukasz, D. Xiaoqun, T. Michael, M. Edward, and E. David, (2001) DIP: The database of interacing proteins: 2001 update, Nucleic Acids Research, 29, 239–241.

[14]   C. Alfarano, C. E. Andrade, K. Anthony, N. Bahroos, M. Bajec, K. Bantoft, D. Betel, B. Bobechko, K. Boutilier, and E. Burgess, (2005) The biomolecular interaction network database and related tools: 2005 update, Nucleic Acids Res., 33, D418–D424.

[15]   A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M. Helmer-Citterich, and G. Cesareni, (2002) MINT: A molecular INTeraction database, FEBS Lett., 513, 135–140.

[16]   H. W. Mewes, C. Amid, R. Arnold, D. Frishman, V. Guldener, G. Mannhaupt, M. Munsterkotter, P. Pagel, N. Strack, and V. Stumpflen, (2004) MIPS: Analysis and annotation of proteins from whole genomes, Nucleic Acids Res, 32, 41–44.

[17]   C. T. Harbison, D. B. Gordon, T. I. Lee, N. J. Rinaldi, K. D. Macisaac, T. W. Danford, N. M. Hannett, J. B. Tagne, D. B. Reynolds, J. Yoo, et al., (2004) Transcriptional regulatory code of a eukaryotic genome, Nature, 431, 99–104.

[18]   T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thompson, I. Simon, et al., (2002) Tanscriptional regulatory networks in Saccharomyces cerevisiae, Science, 298, 799–804.

[19]   S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, (1999) Systematic determination of genetic network architecture, Nature Genetics, 22, 281–285.

[20]   J. Handl, J. Knowles, and D. Kell, (2005) Computational cluster validation in post-genomic data analysis, Bioinformatics, 21, 3201–3212.

[21]   N. Bolshakova, F. Azuje, and P. Cunningham, (2005) A knowledge-driven approach to cluster validity assessment, Bioinformatics, 21, 2546–2547.

[22]   A. Thalamuthu, M. Indranil, X. J. Zeng, and G. C. Tseng, (2006) Evaluation and comparison of gene clustering methods in microarray analysis, Bioinformatics, 22, 2405–2412.

[23]   P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher, (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Mol. Biol. Cell, 9, 3273–3297.

[24]   B. Trond, D. Bjarte, and J. Inge, (2004) LSimpute: accurate estimation of missing values in microarray data with least squares methods, Nucleic Acids Research, 32(3).

[25]   Young Lab, code.