LeaDen-Stream: A Leader Density-Based Clustering Algorithm over Evolving Data Stream

Abstract

Clustering evolving data streams is important to be performed in a limited time with a reasonable quality. The existing micro clustering based methods do not consider the distribution of data points inside the micro cluster. We propose LeaDen-Stream (Leader Density-based clustering algorithm over evolving data Stream), a density-based clustering algorithm using leader clustering. The algorithm is based on a two-phase clustering. The online phase selects the proper mini-micro or micro-cluster leaders based on the distribution of data points in the micro clusters. Then, the leader centers are sent to the offline phase to form final clusters. In LeaDen-Stream, by carefully choosing between two kinds of micro leaders, we decrease time complexity of the clustering while maintaining the cluster quality. A pruning strategy is also used to filter out real data from noise by introducing dense and sparse mini-micro and micro-cluster leaders. Our performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method.

Cite this paper

Amini, A. and Wah, T. (2013) LeaDen-Stream: A Leader Density-Based Clustering Algorithm over Evolving Data Stream.*Journal of Computer and Communications*, **1**, 26-31. doi: 10.4236/jcc.2013.15005.

Amini, A. and Wah, T. (2013) LeaDen-Stream: A Leader Density-Based Clustering Algorithm over Evolving Data Stream.

References

[1] J. Han, M. Kamber and J. Pei, “Data Mining: Concepts and Techniques,” 3rd Edition, Morgan Kaufmann Publishers Inc., San Francisco, 2011.

[2] L. OCállaghan, A. O. Meyerson, N. Mishra and S. Guha, “Streaming Data Algorithms for High-Quality Clustering,” International Conference on Data Engineering, Los Alamitos, IEEE Computer Society, 2002, pp. 685-694.

[3] S. Guha, A. Meyerson, N. Mishra, R. Motwani and L. O’Callaghan, “Clustering Data Streams: Theory and Practice,” IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No. 3, 2003, pp. 515-528.
http://dx.doi.org/10.1109/TKDE.2003.1198387

[4] D. Barbará, “Requirements for Clustering Data Streams,” SIGKDD Explor. Newsl.3, 2002, pp. 23-27.
http://dx.doi.org/10.1145/507515.507519

[5] C. C. Aggarwal, J. Han, J. Wang and P. S. Yu, “A Framework for Clustering Evolving Data Streams,” Proceedings of the 29th International Conference on Very Large Data Bases, VLDB Endowment, 2003, pp. 81-92.

[6] C. C. Aggarwal, “Data Streams—Models and Algorithms,” Springer, 2007.
http://dx.doi.org/10.1007/978-0-387-47534-9

[7] S. Guha, N. Mishra, R. Motwani and L. O’Callaghan, “Clustering Data Streams,” Proceedings of the 41st Annual Symposium on Foundations of Computer Science, Washington DC, IEEE Computer Society, 2000, p. 359.
http://dx.doi.org/10.1109/SFCS.2000.892124

[8] P. Kranen, I. Assent, C. Baldauf and T. Seidl, “The Clustree: Indexing Micro-Clusters for Anytime Stream Mining,” Knowledge Information System, Vol. 29, No. 2, 2011, pp. 249-272.
http://dx.doi.org/10.1007/s10115-010-0342-8

[9] L. Wan, W. K. Ng, X. H. Dang, P. S. Yu and K. Zhang, “Density-Based Clustering of Data Streams at Multiple Resolutions,” ACM Transactions Knowledge Discovery Data, Vol. 3, No. 3, 2009, pp. 1-28.
http://dx.doi.org/10.1145/1552303.1552307

[10] M. Ester, H. P. Kriegel, J. Sander and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Data-bases with Noise,” Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD), Portland, Oregon, AAAI Press, 1996, pp. 226-231.

[11] M. Ankerst, M. M. Breunig, H. P. Kriegel and J. Sander, “Optics: Ordering Points to Identify the Clustering Structure,” SIGMOD Record, Vol. 28, 1999, pp. 49-60.
http://dx.doi.org/10.1145/304181.304187

[12] A. Hinneburg and D. A. Keim, “An Efficient Approach to Clustering in Large Multimedia Databases with Noise,” KDD, 1998, pp. 58-65.

[13] A. Amini, W. Teh Ying, M. R. Saybani and S. R. Aghabozorgi, “A Study of Density-Grid Based Clustering Algorithms on Data Streams,” 8th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD11), Shanghai, IEEE, 2011, pp. 1652-1656.

[14] T. Zhang, R. Ramakrishnan and M. Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Data-bases,” In: J. Widom, Ed., Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, ACM Press, 1996, pp. 103-114.

[15] C. C. Aggarwal, J. Han, J. Wang and P. S. Yu, “A Framework for Projected Clustering of High Dimensional Data Streams,” Proceedings of the Thirtieth International Conference on Very Large Data Bases, Vol. 30, VLDB, 2004, pp. 852-863.

[16] C. C. Aggarwal, J. Han, J. Wang and P. S. Yu, “On High Dimensional Projected Clustering of Data Streams,” Data Mining and Knowledge Discovery, Vol. 10, 2005, pp. 251-273. http://dx.doi.org/10.1007/s10618-005-0645-7

[17] F. Cao, M. Ester, W. Qian and A. Zhou, “Density-Based Clustering over an Evolving Data Stream with Noise,” SIAM Conference on Data Mining, 2006, pp. 328-339.

[18] A. Forestiero, C. Pizzuti and G. Spezzano, “A Single Pass Algorithm for Clustering Evolving Data Streams Based on Swarm Intelligence. Data Mining and Knowledge Discovery,” 2011.

[19] A. Amini and W. Teh Ying, “A Comparative Study of Density-Based Clustering Algorithms on Data Streams: Micro-Clustering Approaches,” In: S. I. Ao, O. Castillo and X. Huang, Eds., Intelligent Control and Innovative Computing, Vol. 110 of Lecture Notes in Electrical Engineering, Springer, 2012, pp. 275-287.

[20] A. Amini and W. Teh Ying, “Density Micro-Clustering Algorithms on Data Streams: A Review,” International Conference on Data Mining and Applications (ICDMA), Hong Kong, 2011, pp. 410-414.

[21] P. Viswanath and R. Pinkesh, “l-dbscan: A Fast Hybrid Density Based Clustering Method,” Proceedings of the 18th International Conference on Pattern Recognition, Vol. 01, ICPR’06, Washington DC, IEEE Computer Society, 2006, pp. 912-915.

[22] P. Viswanath and V. Suresh Babu, “Rough-Dbscan: A Fast Hybrid Density Based Clustering Method for Large Data Sets,” Pattern Recognition Letters, Vol. 30, No. 16, 2009, pp. 1477-1488.
http://dx.doi.org/10.1016/j.patrec.2009.08.008

[23] J. Gao, J. Li, Z. Zhang and P. N. Tan, “An Incremental Data Stream Clustering Algorithm Based on Dense Units Detection. Lecture Notes in Computer Science 3518,” 2005.

[24] Y. Chen and L. Tu, “Density-Based Clustering for Real-Time Stream Data,” Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, New York, ACM, 2007, pp. 133-142.
http://dx.doi.org/10.1145/1281192.1281210

[25] A. Zhou, F. Cao, W. Qian and C. Jin, “Tracking Clusters in Evolving Data Streams over Sliding Windows,” Knowledge and Information Systems, Vol. 15, 2008, pp. 181- 214. http://dx.doi.org/10.1007/s10115-007-0070-x

[26] A. Amini and W. Teh Ying, “DENGRIS-Stream: A Density-Grid Based Clustering Algorithm for Evolving Data Streams over Sliding Window,” International Conference on Data Mining and Computer Engineering (ICDMCE), Bangkok, 2012, pp. 206-210.

[27] L. Tu and Y. Chen, “Stream Data Clustering Based on Grid Density and Attraction,” ACM Transactions on Knowledge Discovery Data, Vol. 3, No. 3, 2009, pp. 1-27.
http://dx.doi.org/10.1145/1552303.1552305

[28] A. Amini and W. Teh Ying, “On Density-Based Clustering Algorithms over Evolving Data Streams: A Summarization Paradigm,” International Conference on Information Technology and Management Innovation (ICITMI), Guangzhou, 2012, pp. 2234-2237.

[29] W. Ng and M. Dash, “Discovery of Frequent Patterns in Transactional Data Streams,” Transactions on Large-Scale Data-and Knowledge-Centered Systems II, Vol. 6380 of Lecture Notes in Computer Science, Springer, Berlin/ Heidelberg, 2010, pp. 1-30.

[30] A. Bifet, G. Holmes, B. Pfahringer, P. Kranen, H. Kremer, T. Jansen and T. Seidl, “Moa: Massive Online Analysis, a Framework for Stream Classification and Clustering,” Journal of Machine Learning Research (JMLR), Vol. 11, 2010, pp. 44-50.

[31] Y. Zhao and G. Karypis, “Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering,” Machine Learning, Vol. 55, 2004, pp. 311-333.
http://dx.doi.org/10.1023/B:MACH.0000027785.44527.d6