JILSA  Vol.2 No.3 , August 2010
A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts
Abstract: This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently.
Cite this paper: nullT. Chumwatana, K. Wong and H. Xie, "A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts," Journal of Intelligent Learning Systems and Applications, Vol. 2 No. 3, 2010, pp. 117-125. doi: 10.4236/jilsa.2010.23015.

[1]   B. Liu, “Web Data Mining: Exploring Hyperlinks, Con-tents, and Usage Data,” 1st Edition, Springer-Verlag, New York Berlin Heidelberg, 2007.

[2]   D. R. K. R. D. Cutting, J. O. Pedersen, J. W. Tukey, “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” Proceedings of ACM Spe-cial Interest Group on Information Retrieval ‘92, Copen-hague, 1992, pp. 318-329.

[3]   I. Matveeva, “Document Representation and Multilevel Measures of Document Similarity,” Irina Matveeva, Document representation and multilevel measures of document similarity, Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Tech-nology: companion volume: doctoral consortium, New York, 2006, pp. 235-238.

[4]   G. K. M. Steinbach and V. Kumar, “A Comparison of Docu-ment Clustering Techniques,” KDD Workshop on Text Mining, Boston, 2000.

[5]   A.-H. Tan, “Text Mining: The state of the art and the challenges,” Proceedings of the PAKDD Workshop on Knowledge Discovery from Advanced Databases, Beijing, 1999, pp. 65-70.

[6]   Q. L. H. Jiao and H.-B. Jia, “Chinese Keyword Extraction Based on N-Gram and Word Co-occurrence, 2007 Inter-national Conference on Computational Intelligence and Security Workshops (CISW 2007), Harbin, 2007, pp. 124-127.

[7]   J. Mathieu, “Adaptation of a Keyphrase Extractor for Japanese Text,” Proceedings of the 27th Annual Confe-rence of the Canadian Association for Information Science (CAIS-99), Sherbrooke, Quebec, 1999, pp. 182-189.

[8]   T. Chumwatana, K. W. Wong and H. Xie “An Automatic Indexing Technique for Thai Texts Using Frequent Max Substring,” 2009 Eight International Symposium on Nat-ural Language Processing, Bangkok, 2009, pp. 67-72.

[9]   R. Feldman and J. Sanger, “The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data,” Cambridge University Press, Cambridge, 2006.

[10]   A. K. Jain and R. C. Dubes, “Algorithms for Clustering Data,” Prentice Hall, New Jersey, 1988.

[11]   L. Kaufman and P. J. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Analysis,” John Wiley and Sons, New York, 1990.

[12]   G. K. Y. Zhao, “Comparison of Agglomerative and Parti-tional Document Clustering Algorithms,” The SIAM workshop on Clustering High-dimensional Data and Its Applications, Washington, DC, April 2002.

[13]   Z. Huang, “Extensions to the K-means Algorithm for Clustering Large Datasets with Categorical Values,” Data Mining and Knowledge Discovery, Vol. 2, No. 3, 1998, pp. 283-304.

[14]   D. Dembele and P. Kastner, “Fuzzy C-Means Method for Clustering Microarray Data,” Bioinformatics, Vol. 19, No. 8, 2003, pp. 973-980.

[15]   L. J. Heyer, S. Kruglyak and S. Yooseph, “Exploring Expression Data: Identification and Analysis of Coex-pressed Genes,” Genome Research, Vol. 9, No. 11, 1999, pp. 1106-1115.

[16]   C. C. Fung, K. W. Wong, H. Eren, R. Charlebois and H. Crocker, “Modular Artificial Neural Network for Predic-tion of Petrophysical Properties from Well Log Data,” IEEE Transactions on Instrumentation & Measurement, Vol. 46, No. 6, December 1997, pp. 1259-1263.

[17]   D. Myers, K. W. Wong and C. C. Fung, “Self-organising Maps Use for Intelligent Data Analysis,” Australian Journal of Intelligent Information Processing Systems, Vol. 6 No. 2, 2000, pp. 89-96.

[18]   D. R. Hill, “A Vector Clustering Technique,” In: Samuel- son, Ed., Mechanized Information Storage, Retrieval and Dissemination North-Holland, Amsterdam, 1968.

[19]   J. J. Rocchio, “Document Retrieval Systems — Optimi-zation and Evaluation,” Doctoral Thesis, Harvard Univer-sity, Boston, 1966.

[20]   A. W. G. Salton and C. S. Yang, “A Vector Space Model for Automatic Indexing,” Communication of ACM, Vol. 18, No. 11, 1975, pp. 613-620.

[21]   O. Zamir, “Clustering Web Documents: A Phrase-Based Method for Group Search Engine Results,” Computer Science & Engineering, Ph.D. Thesis, University of Washington, 1999.

[22]   M. F. H. J. Bakus and M. Kamel, “A SOM-Based Docu-ment Clustering Using Phrases,” Proceeding of the 9th International Conference on Neural Information Processing (ICONIP’02), Vol. 5, 2002, pp. 2212-2216.

[23]   D. Mladenic and M. Grobelnik, “Word Sequence as Fea-tures in Text-learning,” Proceedings of the 17th Electro technical and Computer Science Conference (ERK-98) Ljubljana, Slovenia, 1998.

[24]   K.-H. Tsai, C.-M. Tseng, C.-C. Hsu and H.-C. Chang, “On the Chinese Document Clustering Based on Dynamical Term Clustering,” Asia Information Retrieval Symposium 2005, Jeju Island, October 2005, pp. 534-539.

[25]   C. Kruengkrai and C. Jaruskulchai, “Thai Text Document Clustering Using Parallel Spherical K-means Algorithm on PI-RUN Linux Cluster (in Thai),” The 5th National Computer Science and Engineering Conference, Chiang Mai University, Chiang Mai, 2001, pp. 7-9 .

[26]   T. Kohonen, “Self-Organization and Associative Memo-ry,” Springer Series in Information Sciences, Springer- Verlag, Berlin, 1984, p. 125.

[27]   T. Chumwatana, K. W. Wong and H. Xie “Frequent max substring mining for indexing,” International Journal of Computer Science and System Analysis (IJCSSA), India, 2008, pp. 179-184.

[28]   T. Chumwatana, K. W. Wong and H. Xie “An Efficient Text Mining Technique,” 9th Postgraduate Electrical En-gineering & Computing Symposium (PEECS2008), Perth, Australia, 2008, pp. 147-152.

[29]   T. Chumwatana, K. W. Wong and H. Xie, “Using Fre-quent Max Substring Technique for Thai Keyword Ex-traction used in Thai Text Mining,” 2nd International Conference on Soft Computing, Intelligent System and Information Technology (ICSIIT 2010), Bali, 1-2 July 2010, pp. 309-314.

[30]   T. Chumwatana, K. W. Wong and H. Xie, “Thai Text Mining to Support Web Search for E-Commerce,” The 7th International Conference on e-Business 2008 (INCEB 2008), Bangkok, 2008, pp. 66-70.

[31]   J. E. Hodges and Y. Wang, “Document Clustering using Compound Words,” Proceedings of the 2005 Interna- tional Conference on Artificial Intelligence (ICAI 2005), Las Vegas, Nevada, 2005, pp. 307-313.