JSEA  Vol.7 No.8 , July 2014
D-IMPACT: A Data Preprocessing Algorithm to Improve the Performance of Clustering
ABSTRACT

In this study, we propose a data preprocessing algorithm called D-IMPACT inspired by the IMPACT clustering algorithm. D-IMPACT iteratively moves data points based on attraction and density to detect and remove noise and outliers, and separate clusters. Our experimental results on two-dimensional datasets and practical datasets show that this algorithm can produce new datasets such that the performance of the clustering algorithm is improved.


Cite this paper
Tran, V. , Hirose, O. , Saethang, T. , Nguyen, L. , Dang, X. , Le, T. , Ngo, D. , Sergey, G. , Kubo, M. , Yamada, Y. and Satou, K. (2014) D-IMPACT: A Data Preprocessing Algorithm to Improve the Performance of Clustering. Journal of Software Engineering and Applications, 7, 639-654. doi: 10.4236/jsea.2014.78059.
References
[1]   Berkhin, P. (2002) Survey of Clustering Data Mining Techniques. Technical Report, Accrue Software, San Jose.

[2]   Murty, M.N., Jain, A.K. and Flynn, P.J. (1999) Data Clustering: A Review. ACM Computing Surveys, 31, 264-323. http://dx.doi.org/10.1145/331499.331504

[3]   Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001) On Clustering Validation Techniques. Journal of Intelligent Information Systems, 17, 107-145. http://dx.doi.org/10.1023/A:1012801612483

[4]   Golub, T.R., et al. (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286, 531-537. http://dx.doi.org/10.1126/science.286.5439.531

[5]   Quinn, A. and Tesar, L. (2000) A Survey of Techniques for Preprocessing in High Dimensional Data Clustering. Proceedings of the Cybernetic and Informatics Eurodays.

[6]   Abdi, H. and Williams, L.J. (2010) Principal Component Analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2, 433-459. http://dx.doi.org/10.1002/wics.101

[7]   Yeung, K.Y. and Ruzzo, W.L. (2001) Principal Component Analysis for Clustering Gene Expression Data. Bioinformatics, 17, 763-774. http://dx.doi.org/10.1093/bioinformatics/17.9.763

[8]   Shi, Y., Song, Y. and Zhang, A. (2005) A Shrinking-Based Clustering Approach for Multidimensional Data. IEEE Transaction on Knowledge Data Engineering, 17, 1389-1403.
http://dx.doi.org/10.1109/TKDE.2005.157

[9]   Chang, F., Qiu, W. and Zamar, R.H. (2007) CLUES: A Non-Parametric Clustering Method Based on Local Shrinking. Computational Statistics & Data Analysis, 52, 286-298.
http://dx.doi.org/10.1016/j.csda.2006.12.016

[10]   Jain, A.K. and Dubes, R.C. (1988) Algorithms for Clustering Data. Prentice Hall, Upper Saddle River.

[11]   Ester, M., Kriegel, H.P., Sander, J. and Xu, X. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, 226-231.

[12]   Ankerst, M., Breunig, M.M., Kriegel, H.P. and Sander, J. (1999) OPTICS: Ordering Points to Identify Clustering Structure. Proceedings of the ACM SIGMOD Conference, 49-60.

[13]   Hinneburg, A. and Keim, D. (1998) An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Proceeding 4th International Conference on Knowledge Discovery & Data Mining, 58-65.

[14]   Tran, V.A., et al. (2012) IMPACT: A Novel Clustering Algorithm Based on Attraction. Journal of Computers, 7, 653-665. http://dx.doi.org/10.4304/jcp.7.3.653-665

[15]   The UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/datasets

[16]   Karypis Lab Datasets. http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/chameleon-data.tar.gz

[17]   Karypis, G., Han, E.H. and Kumar, V. (1999) CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Computer, 32, 68-75. http://dx.doi.org/10.1109/2.781637

[18]   Radioresistant and Radiosensitive Tumors and Cell Lines.
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9712

[19]   Chang, F., Qiu, W., Zamar, R.H., Lazarus, R. and Wang, X. (2010) Clues: An R Package for Nonparametric Clustering Based on Local Shrinking. Journal of Statistical Software, 33, 1-16.

[20]   Hubert, L. and Arabie, P. (1985) Comparing Partitions. Journal of Classification, 2, 193-218.

[21]   Visakh, R. and Lakshmipathi, B. (2012) Constraint Based Cluster Ensemble to Detect Outliers in Medical Datasets. International Journal of Computer Applications, 45, 9-15.

[22]   D-IMPACT Preprocessing Algorithm. https://sourceforge.net/projects/dimpactpreproce/

 
 
Top