A new projection method for biological semantic map generation

ABSTRACT

Low-dimensional representation is a convenient method of obtaining a synthetic view of complex datasets and has been used in various domains for a long time. When the representation is related to words in a document, this kind of representation is also called a semantic map. The two most popular methods are self-organizing maps and generative topographic mapping. The second approach is statistically well-founded but far less computationally efficient than the first. On the other hand, a drawback of self-organizing maps is that they do not project all points, but only map nodes. This paper presents a method of obtaining the projections for all data points complementary to the self-organizing map nodes. The idea is to project points so that their initial distances to some cluster centers are as conserved as possible. The method is tested on an oil flow dataset and then applied to a large protein sequence dataset described by keywords. It has been integrated into an interactive data browser for biological databases.

Low-dimensional representation is a convenient method of obtaining a synthetic view of complex datasets and has been used in various domains for a long time. When the representation is related to words in a document, this kind of representation is also called a semantic map. The two most popular methods are self-organizing maps and generative topographic mapping. The second approach is statistically well-founded but far less computationally efficient than the first. On the other hand, a drawback of self-organizing maps is that they do not project all points, but only map nodes. This paper presents a method of obtaining the projections for all data points complementary to the self-organizing map nodes. The idea is to project points so that their initial distances to some cluster centers are as conserved as possible. The method is tested on an oil flow dataset and then applied to a large protein sequence dataset described by keywords. It has been integrated into an interactive data browser for biological databases.

Cite this paper

nullNguyen, H. , Wicker, N. , Kieffer, D. and Poch, O. (2010) A new projection method for biological semantic map generation.*Journal of Biomedical Science and Engineering*, **3**, 13-19. doi: 10.4236/jbise.2010.31002.

nullNguyen, H. , Wicker, N. , Kieffer, D. and Poch, O. (2010) A new projection method for biological semantic map generation.

References

[1] Bishop, C.M., Svens'en, M. and Williams, C.K.I. (1998) GTM: the generative topographic mapping, Neural Computation, 10, 215-234.

[2] Lesteven, K. (1995) Multivariate data analysis applied to bibliographical information retrieval: SIMBAD quality control. Vistas in Astronomy, 39, 187-193.

[3] Kaski, S. (1998) Dimensionality reduction by random mapping: Fast similarity computation for clustering, Proceedings of IJCNN'98, International Joint Conference on Neural Networks, IEEE Service Center, 413-418.

[4] Lagus, K., Kaski, S. and Kohonen, T. (2004) Mining massive document collections by the WEBSOM method. Information Sciences, 163, 135-156.

[5] Chen, C. (2005) CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science, 57, 359-377.

[6] Grimmelstein, M. and Urfer, W.W. (2005) Analyzing protein data with the generative topographic mapping approach. innovations in classification, data science, and information systems, Baier, D. and Wernecke, K.D. Springer Berlin Heidelberg, 585-592.

[7] Ossorio, P.G, (1966) Classification space: a multivariate procedure for automated document indexing and retrieval. Multivariate Behavioral Research, 1, 479-524.

[8] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T. K. and Harshman R. (1990) Indexing by latent semantic indexing. Journal of the American Society for Information Science, 41, 391-407.

[9] Kohonen, T. (1997) Self-Organizating Maps, Springer- Verlag.

[10] Kohonen, T. (1982) Analysis of a simple self-organizing process. Biological Cybernetics, 44, 135-140.

[11] Dempster, A., Laird, N. and Rubin, D. (1977) Maximum likelihood from incomplete data via the {EM} algorithm. Journal of the Royal Statistical Society, Ser. B, 39, 249-282.

[12] Flexer, A. (1997) Limitations of self-organizing maps for vector quantization and multi-dimensional scaling. Advances in neural information processing systems, 9, 445-451.

[13] Sammon J.W. (1969) A non-linear mapping for data structure analysis. IEEE Transactions on Computers, 18, 401-409.

[14] Bishop, C.M. and James G.D. (1993) Analysis of multiphase flows using dual-energy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research, Section A, 327, 580-593.

[15] Nguyen, H., Berthommier, G., Friedrich, A., Poidevin, L., Ripp, R., Moulinier, L. and Poch, O. (2008) Introduction to the new Decrypthon Data Center for biomedical data, Proc CORIA', 32-44.

[16] BIRDQL-Wikili, http://alnitak.u-strasbg.fr/wikili/index.php/BIRDQL.

[17] Décrypthon: le grid-computing au service de la génomique et la protéomique. http://www.decrypthon.fr. (2008) The uniProt consortium. The Universal Protein Resource (UniProt). Nucleic Acids Research, 36, D190- D195.

[18] Titterington, D., Smith, A. and Makov, U. (1985) Statistical analysis of finite mixture distribution, John Wiley and Sons.

[19] McLachlan, G. and Basford, K. (1988) Mixture models: inference and applications to clustering, Marcel Dekker.

[20] Banfield, J. and Raftery, A. (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803-821.

[21] Celeux, G. and Govaert, G. (1992) A classification EM algorithm for clustering and two stochastic versions, Journal of Computational Statistics and Data Analysis, 14, 315-332.

[22] Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979) Multivariate Analysis, Academic Press.

[23] Parzen, E., (1962) On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065-1076.

[24] Hayashi, M., Imanaka-Yoshida, K., Yoshida, T., Wood, M., Fearns, C., Tatake, R. and Lee, J. (2006) A crucial role of mitochondrial Hsp40 in preventing dilated cardiomyopathy, Nature Medecine, 12, 128-132.

[25] Laguna, M. and Marti, R. (2005) Experimental testing of advanced scatter search designs for global optimization of multimodal functions, Journal of Global Optimization 33, 235-255.

[26] Neumaier, A., Shcherbina, O., Huyer, W. and Vinko, T. (2005) A comparison of complete global optimization solvers, Mathematical Programming, 103, 335-356.

[1] Bishop, C.M., Svens'en, M. and Williams, C.K.I. (1998) GTM: the generative topographic mapping, Neural Computation, 10, 215-234.

[2] Lesteven, K. (1995) Multivariate data analysis applied to bibliographical information retrieval: SIMBAD quality control. Vistas in Astronomy, 39, 187-193.

[3] Kaski, S. (1998) Dimensionality reduction by random mapping: Fast similarity computation for clustering, Proceedings of IJCNN'98, International Joint Conference on Neural Networks, IEEE Service Center, 413-418.

[4] Lagus, K., Kaski, S. and Kohonen, T. (2004) Mining massive document collections by the WEBSOM method. Information Sciences, 163, 135-156.

[5] Chen, C. (2005) CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of the American Society for Information Science, 57, 359-377.

[6] Grimmelstein, M. and Urfer, W.W. (2005) Analyzing protein data with the generative topographic mapping approach. innovations in classification, data science, and information systems, Baier, D. and Wernecke, K.D. Springer Berlin Heidelberg, 585-592.

[7] Ossorio, P.G, (1966) Classification space: a multivariate procedure for automated document indexing and retrieval. Multivariate Behavioral Research, 1, 479-524.

[8] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T. K. and Harshman R. (1990) Indexing by latent semantic indexing. Journal of the American Society for Information Science, 41, 391-407.

[9] Kohonen, T. (1997) Self-Organizating Maps, Springer- Verlag.

[10] Kohonen, T. (1982) Analysis of a simple self-organizing process. Biological Cybernetics, 44, 135-140.

[11] Dempster, A., Laird, N. and Rubin, D. (1977) Maximum likelihood from incomplete data via the {EM} algorithm. Journal of the Royal Statistical Society, Ser. B, 39, 249-282.

[12] Flexer, A. (1997) Limitations of self-organizing maps for vector quantization and multi-dimensional scaling. Advances in neural information processing systems, 9, 445-451.

[13] Sammon J.W. (1969) A non-linear mapping for data structure analysis. IEEE Transactions on Computers, 18, 401-409.

[14] Bishop, C.M. and James G.D. (1993) Analysis of multiphase flows using dual-energy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research, Section A, 327, 580-593.

[15] Nguyen, H., Berthommier, G., Friedrich, A., Poidevin, L., Ripp, R., Moulinier, L. and Poch, O. (2008) Introduction to the new Decrypthon Data Center for biomedical data, Proc CORIA', 32-44.

[16] BIRDQL-Wikili, http://alnitak.u-strasbg.fr/wikili/index.php/BIRDQL.

[17] Décrypthon: le grid-computing au service de la génomique et la protéomique. http://www.decrypthon.fr. (2008) The uniProt consortium. The Universal Protein Resource (UniProt). Nucleic Acids Research, 36, D190- D195.

[18] Titterington, D., Smith, A. and Makov, U. (1985) Statistical analysis of finite mixture distribution, John Wiley and Sons.

[19] McLachlan, G. and Basford, K. (1988) Mixture models: inference and applications to clustering, Marcel Dekker.

[20] Banfield, J. and Raftery, A. (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803-821.

[21] Celeux, G. and Govaert, G. (1992) A classification EM algorithm for clustering and two stochastic versions, Journal of Computational Statistics and Data Analysis, 14, 315-332.

[22] Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979) Multivariate Analysis, Academic Press.

[23] Parzen, E., (1962) On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065-1076.

[24] Hayashi, M., Imanaka-Yoshida, K., Yoshida, T., Wood, M., Fearns, C., Tatake, R. and Lee, J. (2006) A crucial role of mitochondrial Hsp40 in preventing dilated cardiomyopathy, Nature Medecine, 12, 128-132.

[25] Laguna, M. and Marti, R. (2005) Experimental testing of advanced scatter search designs for global optimization of multimodal functions, Journal of Global Optimization 33, 235-255.

[26] Neumaier, A., Shcherbina, O., Huyer, W. and Vinko, T. (2005) A comparison of complete global optimization solvers, Mathematical Programming, 103, 335-356.