JILSA  Vol.7 No.4 , November 2015
A KNN Undersampling Approach for Data Balancing
Abstract: In supervised learning, the imbalanced number of instances among the classes in a dataset can make the algorithms to classify one instance from the minority class as one from the majority class. With the aim to solve this problem, the KNN algorithm provides a basis to other balancing methods. These balancing methods are revisited in this work, and a new and simple approach of KNN undersampling is proposed. The experiments demonstrated that the KNN undersampling method outperformed other sampling methods. The proposed method also outperformed the results of other studies, and indicates that the simplicity of KNN can be used as a base for efficient algorithms in machine learning and knowledge discovery.
Cite this paper: Beckmann, M. , Ebecken, N. and Pires de Lima, B. (2015) A KNN Undersampling Approach for Data Balancing. Journal of Intelligent Learning Systems and Applications, 7, 104-116. doi: 10.4236/jilsa.2015.74010.

[1]   Weiss, G.M. and Provost, F. (2001) The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report MLTR-43, Department of Computer Science, Rutgers University, New Brunswick, NJ, USA.

[2]   He, H. and Ma, Y. (2013) Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press, Hoboken, NJ, USA.

[3]   Japkowicz, N. (2003) Class Imbalances. Are We Focusing on the Right Issue? Proceedings of the ICML’2003, Workshop on Learning from Imbalanced Data Sets II, Washington DC.

[4]   Qiong, G., Cai, Z., Zhu, L. and Huang, B. (2008) Data Mining on Imbalanced Data Sets. International Conference on Advanced Computer Theory and Engineering, Phuket, 20-22 December, 1020-1024.

[5]   Barandela, R., Sánchez, J.S., García, V. and Rangel, E. (2003) Strategies for Learning in Class Imbalance Problems. Pattern Recognition, 36, 849-851.

[6]   Fawcett, T. (2004) ROC Graphs: Notes and Practical Considerations for Researchers, HP Laboratories.

[7]   Chawla, N., Bowyer, K., Hall, L. and Kegelmeyer, W.P. (2002) SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357.

[8]   Wilson, D.L. (1972) Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Communications, 2, 408-421.

[9]   Laurikkala, J. (2001) Improving Identification of Difficult Small Classes by Balancing Class Distribution. Technical Report A-2001-2, University of Tampere, Tampere, Finland.

[10]   Beckmann, M., De Lima, B.S.L.P. and Ebecken, N.F.F. (2011) Genetic Algorithms as a Pre-Processing Strategy for Imbalanced Datasets. Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, Dublin, 12-16 July 2011, 131-132.

[11]   García, S. and Herrera, F. (2009) Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy. Evolutionary Computation, 17, 275-396.

[12]   Hilbert, M. and López, P. (2011) The World’s Technological Capacity to Store, Communicate, and Compute Information. Science, 332, 60-65.

[13]   Wu, X.D., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J. and Steinberg, D. (2007) Top 10 Algorithms in Data Mining. Knowledge Information Systems, 14, 1-37.

[14]   Fix, E. and Hodges, J.L. (1951) Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field.

[15]   Dasarathy, B.V. (1991) Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos.

[16]   Duda, R.O., Hart, P.E. and Stork, D.G. (2001) Pattern Classification. 2nd Edition, John Wiley & Sons Ltd., New York, 202-220.

[17]   Boriah, S., Chandola, V. and Kumar, V. (2007) Similarity Measures for Categorical Data: A Comparative Evaluation. Proceedings of the SIAM International Conference on Data Mining, Minneapolis, 26-28 April 2007, 243-254.

[18]   Wilson, D.R. and Martinez, T.R. (1997) Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research, 6, 1-34.

[19]   Chawla, N.V., Lazarevic, A., Hall, L.O. and Bowyer, K.W. (2003) SMOTEBoost: Improving Prediction of the Minority Class in Boosting. Proceeding of 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik, 22-26 September 2003, 107-119.

[20]   Chen, L., Cai, Z., Chen, L. and Gu, Q. (2010) A Novel Differential Evolution-Clustering Hybrid Resampling Algorithm on Imbalanced Datasets. 3rd International Conference on Knowledge Discovery and Data Mining, Phuket, 9-10 January 2010, 81-85.

[21]   Han, H., Wang, W.Y. and Mao, B.H. (2005) Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Proceedings of International Conference on Intelligent Computing, Hefei, 23-26 August 2005, 878-887.

[22]   He, H., Bai, Y. and Garcia, E.A. (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proceedings of International Joint Conference on Neural Networks, Hong Kong, 1-8 June 2008, 1322-1328.

[23]   Batista, G.E.A.P.A., Prati, R.C. and Monard, M.C. (2004) A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, 6, 20-29.

[24]   Tomek, I. (1976) Two Modifications of CNN. IEEE Transactions on Systems Man and Communications, 6, 769-772.

[25]   Wang, B.X. and Japkowicz, N. (2004) Imbalanced Data Set Learning with Synthetic Samples. Proceedings of IRIS Machine Learning Workshop, Ottawa, 9 June 2004.

[26]   Wilson, D.R. and Martinez, T.R. (2000) Reduction Techniques for Instance-Based. Machine Learning, 38, 257-286.

[27]   Van Rijsbergen, C.J. (1979) Information Retrieval. 2nd Edition, Butterworths, Waltham.

[28]   Ian, H.W. and Frank, E. (2005) Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, Morgan Kaufmann, San Francisco.

[29]   Zhang, J.P. and Mani, I. (2003) KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. Proceeding of International Conference on Machine Learning (ICML 2003), Workshop on Learning from Imbalanced Data Sets, Washington DC, 21 August 2003.

[30]   Orriols-Puig, A. and Bernadó-Mansilla, E. (2009) Evolutionary Rule-Based Systems for Imbalanced Datasets. Soft Computing, 13, 213-225.

[31]   Blake, C. and Merz, C. (1998) UCI Repository of Machine Learning Databases. Department of Information and Computer Sciences, University of California, Oakland.

[32]   Kohavi, R. and Quinlan, J.R. (2002) Decision Tree Discovery. In: Klosgen, W. and Zytkow, J.M., Eds., Handbook of Data Mining and Knowledge Discovery, Oxford University Press, New York, 267-276