Scalable Varied Density Clustering Algorithm for Large Datasets

ABSTRACT

Finding clusters in data is a challenging problem especially when the clusters are being of widely varied shapes, sizes, and densities. Herein a new scalable clustering technique which addresses all these issues is proposed. In data mining, the purpose of data clustering is to identify useful patterns in the underlying dataset. Within the last several years, many clustering algorithms have been proposed in this area of research. Among all these proposed methods, density clustering methods are the most important due to their high ability to detect arbitrary shaped clusters. Moreover these methods often show good noise-handling capabilities, where clusters are defined as regions of typical densities separated by low or no density regions. In this paper, we aim at enhancing the well-known algorithm DBSCAN, to make it scalable and able to discover clusters from uneven datasets in which clusters are regions of homogenous densities. We achieved the scalability of the proposed algorithm by using the k-means algorithm to get initial partition of the dataset, applying the enhanced DBSCAN on each partition, and then using a merging process to get the actual natural number of clusters in the underlying dataset. This means the proposed algorithm consists of three stages. Experimental results using synthetic datasets show that the proposed clustering algorithm is faster and more scalable than the enhanced DBSCAN counterpart.

Finding clusters in data is a challenging problem especially when the clusters are being of widely varied shapes, sizes, and densities. Herein a new scalable clustering technique which addresses all these issues is proposed. In data mining, the purpose of data clustering is to identify useful patterns in the underlying dataset. Within the last several years, many clustering algorithms have been proposed in this area of research. Among all these proposed methods, density clustering methods are the most important due to their high ability to detect arbitrary shaped clusters. Moreover these methods often show good noise-handling capabilities, where clusters are defined as regions of typical densities separated by low or no density regions. In this paper, we aim at enhancing the well-known algorithm DBSCAN, to make it scalable and able to discover clusters from uneven datasets in which clusters are regions of homogenous densities. We achieved the scalability of the proposed algorithm by using the k-means algorithm to get initial partition of the dataset, applying the enhanced DBSCAN on each partition, and then using a merging process to get the actual natural number of clusters in the underlying dataset. This means the proposed algorithm consists of three stages. Experimental results using synthetic datasets show that the proposed clustering algorithm is faster and more scalable than the enhanced DBSCAN counterpart.

Cite this paper

nullA. Fahim, A. Salem, F. Torkey, M. Ramadan and G. Saake, "Scalable Varied Density Clustering Algorithm for Large Datasets,"*Journal of Software Engineering and Applications*, Vol. 3 No. 6, 2010, pp. 593-602. doi: 10.4236/jsea.2010.36069.

nullA. Fahim, A. Salem, F. Torkey, M. Ramadan and G. Saake, "Scalable Varied Density Clustering Algorithm for Large Datasets,"

References

[1] J. MacLennan, Z. Tang and B. Crivat, “Data Mining with SQL Server 2008,” Wiley Publishing, Indiana, 2009.

[2] M. Ester, H. P. Kriegel, J. Sander and X. Xu, “A Density Based Algorithm for Discovering Clusters in large Spatial Datasets with Noise,” Proceedings of International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226-231.

[3] A. Hinneburg and D. Keim, “An Efficient Approach to Clustering in Large Multimedia databases with Noise,” Proceedings International Conference on Knowledge Dis- covery and Data Mining, 1998, pp. 58-65.

[4] M. Ankerst, M. Breunig, H. P. Kriegel and J. Sandler, “OPTICS: Ordering Points to Identify the Clustering Structure,” Proceedings of the International Conference on Management of Data (SIGMOD’99), 1999, pp. 49-60.

[5] A. Fahim, G. Saake, A. Salem, F. Torkey and M. Rama- dan, “Enhanced Density Based Spatial clustering of Application with Noise,” in Proceedings of the 2009 Inter- national Conference on Data Mining, Las Vegas, July 2009, pp. 517-523.

[6] C.-F. Tsai and C.-W. Liu, “KIDBSCAN: A New Efficient Data Clustering Algorithm,” Artificial Intelligence and Soft Computing-ICAISC, Springer, Berlin/Heidelberg, 2006, pp. 702-711.

[7] R. Xin and C. H. Duo, “An Improved Clustering Algo- rithm,” International Symposium on Computational Intell- igence and Design, 2008, pp. 394-397.

[8] Y. El-Sonbaty, M. Ismail and M. Farouk, “An Efficient Density Based Clustering Algorithm for Large Data- bases,” Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2004, pp. 673-677.

[9] A. Fahim, A. Salem, F. Torkey and M. Ramadan, “An Efficient Enhanced k-Means Clustering Algorithm,” Journal of Zhejiang University Science A, Vol. 7, No. 10, 2006, pp. 1626-1633.

[10] S. Guha, R. Rastogi and K. Shim, “CURE: An Efficient Clustering Algorithms for Large Databases,” Procee- dings of ACM SIGMOD International Conference on Management of Data, Seattle, 1998, pp. 73-84.

[11] A. K. Jain, M. N. Murty and P. J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, Vol. 31, No. 3, September 1999, pp. 264-323.

[12] L. Ertoz, M. Steinbach and V. Kumar, “A New Shared Nearest Neighbor Clustering Algorithm and its Appli- cations,” Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Con- ference on Data Mining, 2002.

[13] M. Emre Celebi, Y. Alp Aslandogan and P. R. Bergs- tresser, “Mining Biomedical Images with Density-Based Clustering,” Proceedings of the International Confer- ence on Information Technology: Coding and Computing, Washington, DC, IEEE Computer Society, Vol. 1, 2005, pp. 163-168.

[14] J. Sander, M. Ester, H.-P. Kriegel and X. Xu “Density- Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications,” Data Mining and Knowledge Discovery, Vol. 2, No. 2, 1998, pp. 169-194.

[1] J. MacLennan, Z. Tang and B. Crivat, “Data Mining with SQL Server 2008,” Wiley Publishing, Indiana, 2009.

[2] M. Ester, H. P. Kriegel, J. Sander and X. Xu, “A Density Based Algorithm for Discovering Clusters in large Spatial Datasets with Noise,” Proceedings of International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226-231.

[3] A. Hinneburg and D. Keim, “An Efficient Approach to Clustering in Large Multimedia databases with Noise,” Proceedings International Conference on Knowledge Dis- covery and Data Mining, 1998, pp. 58-65.

[4] M. Ankerst, M. Breunig, H. P. Kriegel and J. Sandler, “OPTICS: Ordering Points to Identify the Clustering Structure,” Proceedings of the International Conference on Management of Data (SIGMOD’99), 1999, pp. 49-60.

[5] A. Fahim, G. Saake, A. Salem, F. Torkey and M. Rama- dan, “Enhanced Density Based Spatial clustering of Application with Noise,” in Proceedings of the 2009 Inter- national Conference on Data Mining, Las Vegas, July 2009, pp. 517-523.

[6] C.-F. Tsai and C.-W. Liu, “KIDBSCAN: A New Efficient Data Clustering Algorithm,” Artificial Intelligence and Soft Computing-ICAISC, Springer, Berlin/Heidelberg, 2006, pp. 702-711.

[7] R. Xin and C. H. Duo, “An Improved Clustering Algo- rithm,” International Symposium on Computational Intell- igence and Design, 2008, pp. 394-397.

[8] Y. El-Sonbaty, M. Ismail and M. Farouk, “An Efficient Density Based Clustering Algorithm for Large Data- bases,” Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2004, pp. 673-677.

[9] A. Fahim, A. Salem, F. Torkey and M. Ramadan, “An Efficient Enhanced k-Means Clustering Algorithm,” Journal of Zhejiang University Science A, Vol. 7, No. 10, 2006, pp. 1626-1633.

[10] S. Guha, R. Rastogi and K. Shim, “CURE: An Efficient Clustering Algorithms for Large Databases,” Procee- dings of ACM SIGMOD International Conference on Management of Data, Seattle, 1998, pp. 73-84.

[11] A. K. Jain, M. N. Murty and P. J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, Vol. 31, No. 3, September 1999, pp. 264-323.

[12] L. Ertoz, M. Steinbach and V. Kumar, “A New Shared Nearest Neighbor Clustering Algorithm and its Appli- cations,” Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Con- ference on Data Mining, 2002.

[13] M. Emre Celebi, Y. Alp Aslandogan and P. R. Bergs- tresser, “Mining Biomedical Images with Density-Based Clustering,” Proceedings of the International Confer- ence on Information Technology: Coding and Computing, Washington, DC, IEEE Computer Society, Vol. 1, 2005, pp. 163-168.

[14] J. Sander, M. Ester, H.-P. Kriegel and X. Xu “Density- Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications,” Data Mining and Knowledge Discovery, Vol. 2, No. 2, 1998, pp. 169-194.