JCC  Vol.2 No.11 , September 2014
Parallel K-Means Algorithm for Shared Memory Multiprocessors
Author(s) Tayfun Kucukyilmaz*
Abstract

Clustering is the task of assigning a set of instances into groups in such a way that is dissimilarity of instances within each group is minimized. Clustering is widely used in several areas such as data mining, pattern recognition, machine learning, image processing, computer vision and etc. K-means is a popular clustering algorithm which partitions instances into a fixed number clusters in an iterative fashion. Although k-means is considered to be a poor clustering algorithm in terms of result quality, due to its simplicity, speed on practical applications, and iterative nature it is selected as one of the top 10 algorithms in data mining [1]. Parallelization of k-means is also studied during the last 2 decades. Most of these work concentrate on shared-nothing architectures. With the advent of current technological advances on GPU technology, implementation of the k-means algorithm on shared memory architectures recently start to attract some attention. However, to the best of our knowledge, no in-depth analysis on the performance of k-means on shared memory multiprocessors is done in the literature. In this work, our aim is to fill this gap by providing theoretical analysis on the performance of k-means algorithm and presenting extensive tests on a shared memory architecture.


Cite this paper
Kucukyilmaz, T. (2014) Parallel K-Means Algorithm for Shared Memory Multiprocessors. Journal of Computer and Communications, 2, 15-23. doi: 10.4236/jcc.2014.211002.
References

[1]   Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D.J. and Steinberg, D. (2007) Top 10 Algorithms in Data Mining. Knowl. Inf. Syst., 14, 37.

[2]   Witten, I.H., Frank, E. and Hall, M. (2011) Data Mining: Practical Machine Learning Tools and Techniques. 3rd Edition, Morgan Kaufmann, ISBN: 9780123748560.

[3]   Cox, D.R. (1957) Note on Grouping. Journal of Amer. Statist. Assoc., 52, 543-547.

[4]   McQueen, J. (1997) Some Methods for Classification and Analysis of Multivariate Observations. In: Proc. of 5th Berkeley Symp. on Mathematical Statistics and Probability, 173-188.

[5]   Hartigan, J.A. (1975) Clustering Algorithms. Wiley Publishing Ltd., Chichester.

[6]   Dhillon, S. and Modha, D.S. (2000) A Data Clustering Algorithm on Distributed Memory Multiprocessors. In: Know. Data Discovery (KDD), Lecture Notes in Computer Science, 1759, 245-260.

[7]   Kantrabutra, S. and Couch, A.L. (2000) Parallel k-Means Algorithm on NOWs. Medtec Technical Journal, 1, 243- 249.

[8]   Das, A., Datar, M., Garg, A. and Rajaram, S. (2007) Google News Personalization: Scalable Online Collaborative Filtering. In WWW, 271-280.

[9]   Ene, A., Im, S. and Moseley, B. (2011) Fast Clustering using MapReduce. In: Know. Data Discovery (KDD), 681-689.

[10]   Jin, R. and Agrawal, G. (2003) Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms. In: HPDM: 6th International High Performance, Pervasive, and Data Stream Mining.

[11]   Chinrungrueng, C. and Sequin, C.H. (1995) Optimal Adaptive k-Means Algorithm with Dynamic Adjustment of Learning Rate. IEEE Transactions on Neural Networks, 6, 157-169. http://dx.doi.org/10.1109/72.363440

[12]   Stoffel, K. and Belkoniene, A. (1999) Parallel k/h-Means Clustering for Large Data Sets. In Euro-Par ’99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing, 1451-1454.

[13]   Bahmani, B., Moseley, B., Vattani, A., Kumar, R. and Vasilvitskii, S. (2012) Scalable K-Means++. In: Proceedings of the VLDB Endowment, 5, 622-633.

[14]   Gursoy, A. and Cengiz, I. (2001) Parallel Pruning for K-Means Clustering on Shared Memory Architectures. Lecture Notes in Computer Science, 2150, 321-325. http://dx.doi.org/10.1007/3-540-44681-8_45

[15]   Kumar, J., Mills, R.T., Hoffman, F.M. and Hargrove, W.W. (2011) Parallel k-Means Clustering for Quantitative Ecoregion Delineation Using Large Data Sets. Proceedings of the International Conference on Computational Science, ICCS, 4, 1602-1611.

 
 
Top