ABSTRACT Microarray gene expression measurements are reported, used and archived usually to high numerical precision. However, properties of mRNA molecules, such as their low stability and availability in small copy numbers, and the fact that measurements correspond to a population of cells, rather than a single cell, makes high precision meaningless. Recent work shows that reducing measurement precision leads to very little loss of information, right down to binary levels. In this paper we show how properties of binary spaces can be useful in making inferences from microarray data. In particular, we use the Tanimoto similarity metric for binary vectors, which has been used effectively in the Chemoinformatics literature for retrieving chemical compounds with certain functional properties. This measure, when incorporated in a kernel framework, helps recover any information lost by quantization. By implementing a spectral clustering framework, we further show that a second reason for high performance from the Tanimoto metric can be traced back to a hitherto unnoticed systematic variability in array data: Probe level uncertainties are systematically lower for arrays with large numbers of expressed genes. While we offer no molecular level explanation for this systematic variability, that it could be exploited in a suitable similarity metric is a useful observation in itself. We further show preliminary results that working with binary data considerably reduces variability in the results across choice of algorithms in the preprocessing stages of microarray analysis.
Cite this paper
nullTuna, S. and Niranjan, M. (2009) Classification with binary gene expressions. Journal of Biomedical Science and Engineering, 2, 390-399. doi: 10.4236/jbise.2009.26056.
 I. Shmulevich and W. Zhang, (2002) Binary analysis and opti-mization-based normalization of gene expression data, Bioin-formatics, 18(4), 555–565.
M. J. Zilliox and R. A. Irizarry, (2007) A gene expre- ssion bar code for microarray data, Nature Methods, 4(11), 911–913.
S. Tuna and M. Niranjan, (2009) Inference from low precision transcriptome data representation, Journal of Signal Process-ing Systems, [Online, 22 April 2009], doi: 10.1007/s11265-009-0363-2.
S. Draghici, P. Khatri, A. C. Eklund, and Z. Szallasi, (2006) Reliability and reproducibility issues in DNA microarray measurements, Trends in Genetics, 22(2), 101– 109.
D. Geman, C. d’Avignon, D. Q. Naiman, and R. L. Winslow, (2004) Classifying gene expression profiles from pairwise mRNA comparisons, Statistical Applica- tions in Genetics and Molecular Biology, 3.
M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J. A. Olson, J. R. Marks, and J. R. Nevins, (2001) Predicting the clinical status of human breast cancer by using gene expression profiles Proceedings of National Acad-emy of Sciences, 98(20), 11462–11467.
E. Huang, S. H. Cheng, H. Dressman, J. Pittman, M. Tsou, C. Horng, A. Bild, E. S. Iversen, M. Liao, C. Chen, M. West, J. R. Nevins, and A. T. Huang, (2003) Gene expression predictors of breast cancer outcomes Lancet, 361, 1590–1596.
G. J. Gordon, R. V. Jensen, L. Hsiao, S. R. Gullans, J. E. Blu-menstock, S. Ramaswamy, W. G. Richards, D. J. Sugarbaker, and R. Bueno, (2002) Translation of micro- array data into clinically relevant cancer diagnostic tests using gene expres-sion ratios in lung cancer and mesothelioma, Cancer Research, 62(17), 4963–4967.
M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, and D. Haussler, (2000) Knowl-edge-based analysis of microarray gene expression data by using support vector machines, Proceedings of National Academy of Sciences, 97(1), 262–267.
U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine, (1999) Broad patterns of gene expres-sion revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of Na-tional Academy of Sciences, 96(12), 6745–6750.
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasen-beek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander, (1999) Molecu-lar classification of cancer: Class discovery and class predic-tion by gene expression moni- toring, Science, 286(5439), 531–537.
T. T. Tanimoto, (1958) “An elementary mathematical theory of classification and prediction,” IBM Internal Report.
P. Willett, (2006) Similarity-based virtual screening using 2d fingerprints, Drug Discovery Today, 11(23/24), 1046– 1053.
P. Willett, J. M. Barnard, and G. M. Downs, (1998) Chemical similarity searching, Journal of Chemical Information and Computer Sciences, 38(6), 983–996.
J. D. Holliday, N. Salim, M. Whittle, and P. Willett, (2003) Analysis and display of the size dependence of chemical simi-larity coefficients, Journal of Chemical Information and Com-puter Sciences, 43(3), 819–828.
M. Trotter, (2006) Support vector machines for drug discovery. PhD thesis, University College London, UK.
M. Dettling, (2004) BagBoosting for tumor classification with gene expression data, Bioinformatics, 20(18), 3583– 3593.
M. Brewer, (2007) Development of a spectral clustering method for the analysis of molecular data sets, Journal of Chemical Information and Modeling, 47(5), 1727–1733.
M. Milo, A. Fazeli, M. Niranjan, and N. D. Lawrence, (2003) A probabilistic model for the extraction of expression levels from oligonucleotide arrays, Biochemical Society Transactions, 31(6), 1510–1512.
X. Zhou, X. Wang, and E. R. Dougherty, (2003) Binarization of microarray data on the basis of a mixture model, Molecular Cancer the Rapeutics, 2(7), 679–684.
S. Gunn, (1998) Support vector machines for classification and regression, Tech. Rep., University of Southampton.
R. O. Duda, P. E. Hart, and D. G. Stork, (2001) Pattern Classi-fication, John Wiley & Sons, USA, ISBN 0-41- 05669-3.
M. Rattray, X. Liu, G. Sanguinetti, M. Milo, and N. Lawrence, (2006) Propagating uncertainty in microarray data analysis, Briefings in Bioinformatics, 7(1), 37–47.
G. Sanguinetti, M. Milo, M. Rattray, and N. D. Lawrence, (2005) Accounting for probe-level noise in principal compo-nent analysis of microarray data, Bioinformatics, 21(19), 3748–3754.