JDAIP  Vol.3 No.4 , November 2015
Identifying Semantic in High-Dimensional Web Data Using Latent Semantic Manifold
Abstract: Latent Semantic Analysis involves natural language processing techniques for analyzing relationships between a set of documents and the terms they contain, by producing a set of concepts (related to the documents and terms) called semantic topics. These semantic topics assist search engine users by providing leads to the more relevant document. We develope a novel algorithm called Latent Semantic Manifold (LSM) that can identify the semantic topics in the high-dimensional web data. The LSM algorithm is established upon the concepts of topology and probability. Asearch tool is also developed using the LSM algorithm. This search tool is deployed for two years at two sites in Taiwan: 1) Taipei Medical University Library, Taipei, and 2) Biomedical Engineering Laboratory, Institute of Biomedical Engineering, National Taiwan University, Taipei. We evaluate the effectiveness and efficiency of the LSM algorithm by comparing with other contemporary algorithms. The results show that the LSM algorithm outperforms compared with others. This algorithm can be used to enhance the functionality of currently available search engines.
Cite this paper: Kumar, A. , Maskara, S. , Chiang, I. (2015) Identifying Semantic in High-Dimensional Web Data Using Latent Semantic Manifold. Journal of Data Analysis and Information Processing, 3, 136-152. doi: 10.4236/jdaip.2015.34014.

[1]   Donoho, D.L. (2000) High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality. AMS Math Challenges Lecture, 1-32.

[2]   Laney, D. (2001) 3D Data Management: Controlling Data Volume, Velocity and Variety. META Group Research Note, 6.

[3]   Hoehndorf, R., Rebholz-Schuhmann, D., Haendel, M. and Stevens, R. (2014) Thematic Series on Biomedical Ontologies in JBMS: Challenges and New Directions. Journal of Biomedical Semantics, 5, 15.

[4]   Raman, A.C. (2014) Storage Infrastructure for Big Data and Cloud. Handbook of Research on Cloud Infrastructures for Big Data Analytics, 110.

[5]   Ranganathan, P. (2011) From Microprocessors to Nanostores: Rethinking Data-Centric Systems. Computer, 44, 39-48.

[6]   Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W. and Rhee, S.Y. (2008) Big Data: The Future of Biocuration. Nature, 455, 47-50.

[7]   Gracia, J., Montiel-Ponsoda, E., Cimiano, P., Gomez-Perez, A., Buitelaar, P. and McCrae, J. (2012) Challenges for the Multilingual Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 11, 63-71.

[8]   Croft, W.B., Metzler, D. and Strohman, T. (2010) Search Engines: Information Retrieval in Practice. Addison-Wesley, Reading, 88.

[9]   Thomas, P., Starlinger, J., Vowinkel, A., Arzt, S. and Leser, U. (2012) Gene View: A Comprehensive Semantic Search Engine for PubMed. Nucleic Acids Research, 40, W585-W591.

[10]   Every 2 Days We Create As Much Information As We Did up to 2003.

[11]   Mayer-Schonberger, V. and Cukier, K. (2013) Big Data: A Revolution That Will Transform How We Live, Work, and Think. Houghton Mifflin Harcourt, Boston.

[12]   Lingwal, S. and Gupta, B. (2012) A Comparative Study of Different Approaches for Improving Search Engine Performance. International Journal of Emerging Trends & Technology in Computer Science, 1, 123-132.

[13]   Freitas, A., Curry, E., Oliveira, J.G. and Riain, S.O. (2012) Querying Heterogeneous Datasets on the Linked Data Web: Challenges, Approaches, and Trends. IEEE Internet Computing, 16, 24-33.

[14]   Dalal, M.K. and Zaveri, M.A. (2013) Automatic Classification of Unstructured Blog Text.

[15]   Vercruysse, S. and Kuiper, M. (2012) Jointly Creating Digital Abstracts: Dealing with Synonymy and Polysemy. BMC Research Notes, 5, 601.

[16]   Singer, G., Norbisrath, U. and Lewandowski, D. (2012) Ordinary Search Engine Users Carrying out Complex Search Tasks. Journal of Information Science, 39, 346-358.

[17]   Brossard, D. and Scheufele, D.A. (2013) Science, New Media, and the Public. Science, 339, 40-41.

[18]   Beall, J. (2008) The Weaknesses of Full-Text Searching. The Journal of Academic Librarianship, 34, 438-444.

[19]   Liu, L. and Feng, J. (2011) The Notion of “Meaning System” and Its Use for “Semantic Search”. Journal of Computations and Modelling, 1, 97-126.

[20]   Stumme, G., Hotho, A. and Berendt, B. (2006) Semantic Web Mining: State of the Art and Future Directions. Web Semantics: Science, Services and Agents on the World Wide Web, 4, 124-143.

[21]   Blanco, R., Halpin, H., Herzig, D.M., Mika, P., Pound, J., Thompson, H.S. and Tran, T. (2013) Repeatable and Reliable Semantic Search Evaluation. Web Semantics: Science, Services and Agents on the World Wide Web, 21, 14-29.

[22]   Nessah, D. and Kazar, O. (2013) An Improved Semantic Information Searching Scheme Based Multi-Agent System and an Innovative Similarity Measure. International Journal of Metadata, Semantics and Ontologies, 8, 282-297.

[23]   Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A. and Decker, S. (2011) Searching and Browsing Linked Data with Swse: The Semantic Web Search Engine. Web Semantics: Science, Services and Agents on the World Wide Web, 9, 365-401.

[24]   Fazzinga, B., Gianforme, G., Gottlob, G. and Lukasiewicz, T. (2011) Semantic Web Search Based on Ontological Conjunctive Queries. Web Semantics: Science, Services and Agents on the World Wide Web, 9, 453-473.

[25]   Lu, Z.Y. (2011) PubMed and Beyond: A Survey of Web Tools for Searching Biomedical Literature. Database, 2011, baq036.

[26]   Kim, J.J., Pezik, P. and Rebholz-Schuhmann, D. (2008) MedEvi: Retrieving Textual Evidence of Relations between Biomedical Concepts from Medline. Bioinformatics, 24, 1410-1412.

[27]   Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M. and Stoehr, P. (2007) EBIMed—Text Crunching to Gather Facts for Proteins from Medline. Bioinformatics, 23, e237-e244.

[28]   Ohta, T., Tsuruoka, Y., Takeuchi, J., Kim, J.D., Miyao, Y., Yakushiji, A., et al. (2006) An Intelligent Search Engine and GUI-Based Efficient MEDLINE Search Tool Based on Deep Syntactic Parsing. Proceedings of the COLING/ACL on Interactive Presentation Sessions, Sydney, 17-21 July 2006, Association for Computational Linguistics, 17-20.

[29]   Douglas, S.M., Montelione, G.T. and Gerstein, M. (2005) PubNet: A Flexible System for Visualizing Literature Derived Networks. Genome Biology, 6, R80.

[30]   Doms, A. and Schroeder, M. (2005) GoPubMed: Exploring PubMed with the Gene Ontology. Nucleic Acids Research, 33, W783-W786.

[31]   Argo: Genome Browser.

[32]   Engels, R., Yu, T., Burge, C., Mesirov, J.P., DeCaprio, D. and Galagan, J.E. (2006) Combo: A Whole Genome Comparative Browser. Bioinformatics, 22, 1782-1783.

[33]   Koshman, S., Spink, A. and Jansen, B.J. (2006) Web Searching on the Vivisimo Search Engine. Journal of the American Society for Information Science and Technology, 57, 1875-1887.

[34]   Sah, M. and Wade, V. (2012) Automatic Metadata Mining from Multilingual Enterprise Content. Web Semantics: Science, Services and Agents on the World Wide Web, 11, 41-62.

[35]   Bergamaschi, S., Domnori, E., Guerra, F., TrilloLado, R. and Velegrakis, Y. (2011) Keyword Search Over Relational Databases: A Metadata Approach. Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, ACM, New York, 565-576.

[36]   Luhn, H.P. (1958) The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2, 159-165.

[37]   Zipf, G.K. (1949) Human Behavior and the Principle of Least Effort.

[38]   Salton, G. and McGill, M.J. (1983) Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York.

[39]   Kupiec, J., Pedersen, J. and Chen, F. (1995) A Trainable Document Summarizer. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 68-73.

[40]   Gabaix, X. (1999) Zipf’s Law for Cities: An Explanation. Quarterly Journal of Economics, 114, 739-767.

[41]   Aldous, D.J. (1985) Exchangeability and Related Topics. Springer, Berlin, 1-198.

[42]   Warmuth, W. (1977) De Finetti, B.: Theory of Probability—A Critical Introductory Treatment, Volume 2. John Wiley and Sons, London-New York-Sydney-Toronto 1975. XIV, 375 S. Biometrical Journal, 19, 382.

[43]   Reinhardt, H.E. (1978) Theory of Probability: A Critical Introductory Treatment, Vol. 2 (Bruno de Finetti). SIAM Review, 20, 200-201.

[44]   Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003) Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3, 993-1022.

[45]   Flores, J.G., Gillard, L., Ferret, O. and de Chandelar, G. (2008) Bag of Senses versus Bag of Words: Comparing Semantic and Lexical Approaches on Sentence Extraction. TAC 2008 Workshop-Notebook Papers and Results, Gaithersburg, 17-19 November 2008, 158-167.

[46]   Chanlekha, H. and Collier, N. (2010) Analysis of Syntactic and Semantic Features for Fine-Grained Event-Spatial Understanding in Outbreak News Reports. Journal of Biomedical Semantics, 1, 3.

[47]   Juang, B.H. and Rabiner, L.R. (1991) Hidden Markov Models for Speech Recognition. Technometrics, 33, 251-272.

[48]   Mooij, J.M. and Kappen, H.J. (2007) Sufficient Conditions for Convergence of the Sum-Product Algorithm. IEEE Transactions on Information Theory, 53, 4422-4437.

[49]   Yedidia, J.S., Freeman, W.T. and Weiss, Y. (2003) Understanding Belief Propagation and Its Generalizations. Exploring Artificial Intelligence in the New Millennium, 8, 236-239.

[50]   Yedidia, J.S., Freeman, W.T. and Weiss, Y. (2005) Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms. IEEE Transactions on Information Theory, 51, 2282-2312.

[51]   Wagholikar, K.B., Torii, M., Jonnalagadda, S. and Liu, H. (2013) Pooling Annotated Corpora for Clinical Concept Extraction. Journal of Biomedical Semantics, 4, 3.

[52]   Baum, L.E. and Petrie, T. (1966) Statistical Inference for Probabilistic Functions of Finite State Markov Chains. The Annals of Mathematical Statistics, 37, 1554-1563.

[53]   Rabiner, L.R. (1989) A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77, 257-286.

[54]   Sutton, C. and McCallum, A. (2011) An Introduction to Conditional Random Fields. Machine Learning, 4, 267-373.

[55]   Lafferty, J., McCallum, A. and Pereira, F.C. (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.

[56]   Wallach, H.M. (2004) Conditional Random Fields: An Introduction. Technical Reports (CIS), 22.

[57]   Srebro, N. and Jaakkola, T. (2003) Weighted Low-Rank Approximations. Proceedings of the 20th International Conference on Machine Learning, ICML 2003, 3, 720-727.

[58]   Diestel, R. (2005) Graph Theory. Springer-Verlag, New York.

[59]   Rose, T., Stevenson, M. and Whitehead, M. (2002) The Reuters Corpus Volume 1—From Yesterday’s News to Tomorrow’s Language Resources. Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002, 2, 827-832.

[60]   Hersh, W., Buckley, C., Leone, T.J. and Hickam, D. (1994) OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: Croft, B.W. and van Rijsbergen, C.J., Eds., SIGIR’94, Springer, London, 192-201.

[61]   Xu, W. and Gong, Y. (2004) Document Clustering by Concept Factorization. Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 202-209.

[62]   Dalli, A. (2003) Adaptation of the F-Measure to Cluster Based Lexicon Quality Evaluation. Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing: Are Evaluation Methods, Metrics and Resources Reusable? Association for Computational Linguistics, Stroudsburg, 51-56.

[63]   Kummamuru, K., Lotlikar, R., Roy, S., Singal, K. and Krishnapuram, R. (2004) A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. Proceedings of the 13th International Conference on World Wide Web, ACM, New York, 658-665.

[64]   Fung, B.C., Wang, K. and Ester, M. (2003) Hierarchical Document Clustering Using Frequent Itemsets. Proceedings of the 2003 SIAM International Conference on Data Mining, 3, 59-70.

[65]   Steinbach, M., Karypis, G. and Kumar, V. (2000) A Comparison of Document Clustering Techniques. KDD Workshop on Text Mining, 400, 525-526.

[66]   Cai, L. and Hofmann, T. (2003) Text Categorization by Boosting Automatically Extracted Concepts. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 182-189.

[67]   Chiang, I.J. (2007) Discover the Semantic Topology in High-Dimensional Data. Expert Systems with Applications, 33, 256-262.

[68]   Hofmann, T. (1999) Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 50-57.

[69]   Palla, G., Derenyi, I., Farkas, I. and Vicsek, T. (2005) Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society. Nature, 435, 814-818.

[70]   Dhillon, I.S. and Modha, D.S. (2001) Concept Decompositions for Large Sparse Text Data Using Clustering. Machine learning, 42, 143-175.

[71]   Shi, J. and Malik, J. (2000) Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888-905.

[72]   Kim, J.D., Ohta, T., Tateisi, Y. and Tsujii, J.I. (2003) GENIA Corpus—A Semantically Annotated Corpus for Bio-Textmining. Bioinformatics, 19, i180-i182.

[73]   Cohen, W.W. and Richman, J. (2002) Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, 475-480.

[74]   Lipscomb, C.E. (2000) Medical Subject Headings (MeSH). Bulletin of the Medical Library Association, 88, 265-266.

[75]   Lowe, H.J. and Barnett, G.O. (1994) Understanding and Using the Medical Subject Headings (MeSH) Vocabulary to Perform Literature Searches. JAMA, 271, 1103-1108.