Health  Vol.1 No.1 , June 2009
Incorporating heterogeneous biological data sources in clustering gene expression data
In this paper, a similarity measure between genes with protein-protein interactions is pro-posed. The chip-chip data are converted into the same form of gene expression data with pear-son correlation as its similarity measure. On the basis of the similarity measures of protein- protein interaction data and chip-chip data, the combined dissimilarity measure is defined. The combined distance measure is introduced into K-means method, which can be considered as an improved K-means method. The improved K-means method and other three clustering methods are evaluated by a real dataset. Per-formance of these methods is assessed by a prediction accuracy analysis through known gene annotations. Our results show that the improved K-means method outperforms other clustering methods. The performance of the improved K-means method is also tested by varying the tuning coefficients of the combined dissimilarity measure. The results show that it is very helpful and meaningful to incorporate het-erogeneous data sources in clustering gene expression data, and those coefficients for the genome-wide or completed data sources should be given larger values when constructing the combined dissimilarity measure.

Cite this paper
nullLi, G. and Wang, Z. (2009) Incorporating heterogeneous biological data sources in clustering gene expression data. Health, 1, 17-23. doi: 10.4236/health.2009.11004.
[1]   D. Lockhart and E. Winzeler, (2000) Genomics gene expression and DNA arrays, Nature, 405, 827-846.

[2]   M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, (1998) Cluster analysis and display of genome- wide expression patterns, Proc. Natl Acad. Sci. USA, 95, 14863-14868.

[3]   S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, (1999) Systematic determination of genetic network architecture, Nature Genetics, 22, 281-285.

[4]   P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub, (1999) Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation, Proc. Natl Acad. Sci. USA, 96, 2907–2912.

[5]   S. Raychaudhuri, J. M. Stuart, and R. B. Altman, (2000) Principal component analysis to summarize microarray experiments: Application to sporulation time series, Pac. Symp. Biocomput., 455–466.

[6]   M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Jr. Ares, and D. Haussler, (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines, Proc. Natl Acad. Sci, 97, 262–267.

[7]   H. Xia, A. Panaye, and B. T. Fan, (2007) Nonlinear SVM approaches to QSPR/QSAR studies and drug design, Current Computer-Aided Drug Design, 3, 341–352.

[8]   D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer, (2002) Coclustering of biological networks and gene expression data, Bioinformatics, 18, 145–154.

[9]   J. Kasturi and R. Acharya, (2005) Clustering of diverse genomic data using information fusion, Bioinformatics, 21, 423–429.

[10]   K. Rafal and Z. Adam, (2006) Incorporating gene ontology in clustering gene expression data, CBMS’06.

[11]   L. Kaufman and P. Rousseeuw, (1990) Finding groups in data: An introduction to cluster analysis, Wiley, New York.

[12]   S. Chris, B. Bobby-Joe , R. Teresa, B. Lorrie, B. Ashton, and T. Mike, (2006) BioGRID: A general repository for interaction datasets, Nucleic Acids Research, Database issue, 34, D535–D539.

[13]   X. loannis, F. Esteban, S. Lukasz, D. Xiaoqun, T. Michael, M. Edward, and E. David, (2001) DIP: The database of interacing proteins: 2001 update, Nucleic Acids Research, 29, 239–241.

[14]   C. Alfarano, C. E. Andrade, K. Anthony, N. Bahroos, M. Bajec, K. Bantoft, D. Betel, B. Bobechko, K. Boutilier, and E. Burgess, (2005) The biomolecular interaction network database and related tools: 2005 update, Nucleic Acids Res., 33, D418–D424.

[15]   A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, G. Ausiello, M. Helmer-Citterich, and G. Cesareni, (2002) MINT: A molecular INTeraction database, FEBS Lett., 513, 135–140.

[16]   H. W. Mewes, C. Amid, R. Arnold, D. Frishman, V. Guldener, G. Mannhaupt, M. Munsterkotter, P. Pagel, N. Strack, and V. Stumpflen, (2004) MIPS: Analysis and annotation of proteins from whole genomes, Nucleic Acids Res, 32, 41–44.

[17]   C. T. Harbison, D. B. Gordon, T. I. Lee, N. J. Rinaldi, K. D. Macisaac, T. W. Danford, N. M. Hannett, J. B. Tagne, D. B. Reynolds, J. Yoo, et al., (2004) Transcriptional regulatory code of a eukaryotic genome, Nature, 431, 99–104.

[18]   T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thompson, I. Simon, et al., (2002) Tanscriptional regulatory networks in Saccharomyces cerevisiae, Science, 298, 799–804.

[19]   S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, (1999) Systematic determination of genetic network architecture, Nature Genetics, 22, 281–285.

[20]   J. Handl, J. Knowles, and D. Kell, (2005) Computational cluster validation in post-genomic data analysis, Bioinformatics, 21, 3201–3212.

[21]   N. Bolshakova, F. Azuje, and P. Cunningham, (2005) A knowledge-driven approach to cluster validity assessment, Bioinformatics, 21, 2546–2547.

[22]   A. Thalamuthu, M. Indranil, X. J. Zeng, and G. C. Tseng, (2006) Evaluation and comparison of gene clustering methods in microarray analysis, Bioinformatics, 22, 2405–2412.

[23]   P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher, (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Mol. Biol. Cell, 9, 3273–3297.

[24]   B. Trond, D. Bjarte, and J. Inge, (2004) LSimpute: accurate estimation of missing values in microarray data with least squares methods, Nucleic Acids Research, 32(3).

[25]   Young Lab, code.