JDAIP  Vol.3 No.3 , August 2015
An Improved Algorithm for Imbalanced Data and Small Sample Size Classification
Abstract: Traditional classification algorithms perform not very well on imbalanced data sets and small sample size. To deal with the problem, a novel method is proposed to change the class distribution through adding virtual samples, which are generated by the windowed regression over-sampling (WRO) method. The proposed method WRO not only reflects the additive effects but also reflects the multiplicative effect between samples. A comparative study between the proposed method and other over-sampling methods such as synthetic minority over-sampling technique (SMOTE) and borderline over-sampling (BOS) on UCI datasets and Fourier transform infrared spectroscopy (FTIR) data set is provided. Experimental results show that the WRO method can achieve better performance than other methods.
Cite this paper: Hu, Y. , Guo, D. , Fan, Z. , Dong, C. , Huang, Q. , Xie, S. , Liu, G. , Tan, J. , Li, B. and Xie, Q. (2015) An Improved Algorithm for Imbalanced Data and Small Sample Size Classification. Journal of Data Analysis and Information Processing, 3, 27-33. doi: 10.4236/jdaip.2015.33004.

[1]   Zheng, Z.H., Wu, X.Y. and Srihari, R. (2004) Feature Selection for Text Categorization on Imbalanced Data. ACM SIGKDD Explorations Newsletter, 6, 80-89.

[2]   Xie, J.G. and Qiu, Z.D. (2007) The Effect of Imbalanced Data Sets on LDA: A Theoretical and Empirical Analysis. Pattern Recognition, 40, 557-662.

[3]   Chawla, N. (2003) C4.5 and Imbalanced Data Sets: Investigating the Effect of Sampling Method, Probabilistic Estimate and Decision Tree Structure. Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC.

[4]   Nguyen, H.M., Cooper, E.W. and Kamei, K. (2011) Borderline Over-Sampling for Imbalanced Data Classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3, 4-21.

[5]   Veropoulos, K., Campbell, C. and Cristianini, N. (1999) Controlling the Sensitivity of Support Vector Machines. Proceedings of the International Joint Conference on AI, 55-60.

[6]   Wu, G. and Chang, E.Y. (2003) Class-Boundary Alignment for Imbalanced Dataset Learning. Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC.

[7]   Huang, K.Z. and Yang, H.Q. (2004) Learning Classifiers from Imbalanced Data Based on Biased Minimax Probability Machine. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2, 558-563.

[8]   Zhou, Z.H. and Liu, X.Y. (2006) Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Transactions on Knowledge and Data Engineering, 18, 63-77.

[9]   Manevitz, L.M. and Yousef, M. (2002) One-Class SVMs for Document Classification. Journal of Machine Learning Research, 2, 139-154.

[10]   Samal, A. and Iyengar, P.A. (1992) Automatic Recognition and Analysis of Human Faces and Facial Expressions: A Survey. Pattern Recognition, 25, 65-77.

[11]   Belhumeur, P.N., Hespanha, J.P. and Kriegman, D.J. (1997) Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 711-720.

[12]   Vapnik, V.N. (2000) The Nature of Statistical Learning Theory. 2nd Edition, Springer, Berlin.

[13]   Akbani, R., Kwek, S. and Japkowicz, N. (2004) Applying Support Vector Machines to Imbalanced Datasets. Machine Learning: ECML 2004. Springer, Berlin, 39-50.

[14]   Luo, J.W., Ying, K. and Bai, J. (2005) Savitzky-Golay Smoothing and Differentiation Filter for Even Number Data. Signal Processing, 85, 1429-1434.

[15]   Asuncion, A. and Jnewman, D. (2007) UCI Machine Learning Repository.

[16]   Kubat, M. and Matwin, S. (1997) Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. Proceedings of the 14th International Conference on Machine Learning, 179-186.

[17]   Chang, C.C. and Lin, C.J. (2011) LIBSVM: A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2, 1-27.