JDAIP  Vol.2 No.4 , November 2014
A Feature Subset Selection Technique for High Dimensional Data Using Symmetric Uncertainty
With the abundance of exceptionally High Dimensional data, feature selection has become an essential element in the Data Mining process. In this paper, we investigate the problem of efficient feature selection for classification on High Dimensional datasets. We present a novel filter based approach for feature selection that sorts out the features based on a score and then we measure the performance of four different Data Mining classification algorithms on the resulting data. In the proposed approach, we partition the sorted feature and search the important feature in forward manner as well as in reversed manner, while starting from first and last feature simultaneously in the sorted list. The proposed approach is highly scalable and effective as it parallelizes over both attribute and tuples simultaneously allowing us to evaluate many of potential features for High Dimensional datasets. The newly proposed framework for feature selection is experimentally shown to be very valuable with real and synthetic High Dimensional datasets which improve the precision of selected features. We have also tested it to measure classification accuracy against various feature selection process.

Cite this paper
Singh, B. , Kushwaha, N. and Vyas, O. (2014) A Feature Subset Selection Technique for High Dimensional Data Using Symmetric Uncertainty. Journal of Data Analysis and Information Processing, 2, 95-105. doi: 10.4236/jdaip.2014.24012.
[1]   Song, Q., Ni, J. and Wang, G. (2013) A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimen- sional Data. IEEE Transactions on Knowledge and Data Engineering, 25, 1-14.

[2]   Ben-Bassat, M. (1982) Pattern Recognition and Reduction of Dimensionality. In: Krishnaiah, P.R. and Kanal, L.N., Eds., Handbook of Statistics-II, Vol. 1, North Holland, Amsterdam, 773-791.

[3]   Mitra, P., Murthy, C.A. and Pal, S.K. (2002) Unsupervised Feature Selection Using Feature Similarity. IEEE Transaction on Pattern Analysis and Machine Intelligence, 24, 301-312.

[4]   Blum, A.L. and Langley, P. (1997) Selection of Relevant Features and Examples in Machine Learning. Artificial Intel ligence, 97, 245-271. http://dx.doi.org/10.1016/S0004-3702(97)00063-5

[5]   Kohavi, R. and John, G.H. (1997) Wrappers for Feature Subset Selection. Artificial Intelligence, 97, 273-324. http://dx.doi.org/10.1016/S0004-3702(97)00043-X

[6]   John, G.H., Kohavi, R. and Pfleger, K. (1994) Irrelevant Feature and the Subset Selection Problem. Proceedings of 11th International Conference on Machine Learning, New Brunswick, 10-13 July 1994, 121-129.

[7]   Chow, T.W.S. and Huang, D. (2005) Effective Feature Selection Scheme Using Mutual Information. Neurocomputing, 63, 325-343. http://dx.doi.org/10.1016/j.neucom.2004.01.194

[8]   Kim, Y., Street, W. and Menczer, F. (2000) Feature Selection for Unsupervised Learning via Evolutionary Search. Proceedings of 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, August, 365-369.

[9]   Dash, M., Choi, K., Scheuermann, P. and Liu, H. (2002) Feature Selection for Clustering a Filter Solution. Proceedings of Second International Conference on Data Mining, Florida, 19-22 November, 115-122.

[10]   Liu, H. and Yu, L. (2005) Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Transactions on Knowledge and Data Engineering, 17, 491-502.

[11]   Yang, Y. and Pederson, J.O. (1997) A Comparative Study on Feature Selection in Text Categorization. Proceedings of 14th International Conference on Machine Learning, Nashville, 8-12 July 1997, 412-420.

[12]   Nigam, K., Mccallum, A.K., Thrun, S. and Mitchell, T. (2000) Text Classification from Labeled and Unlabeled Docu- ments Using EM. Journal of Machine Learning, 39, 103-134.

[13]   Guldogan, E. and Gabbouj, M. (2008) Feature Selection for Content-Based Image Retrieval. Signal, Image and Video Processing, 2, 241-250. http://dx.doi.org/10.1007/s11760-007-0049-9

[14]   Vasconcelos, M. and Vasconcelos, N. (2009) Natural Image Statistics and Low-Complexity Feature Selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 228-244.

[15]   Oveisi, F., Oveisi, S., Efranian, A. and Patras, I. (2012) Tree-Structured Feature Extraction Using Mutual Information. IEEE Transactions on Neural Networks and Learning Systems, 23, 127-137.

[16]   Press, W.H., Flannery, B.P., Teukolsky, S.A. and Vetterling, W.T. (1988) Numerical Recipes in C. Cambridge Univer- sity Press, Cambridge.

[17]   Ng, K.S. and Liu, H. (2000) Customer Retention via Data Mining. Artificial Intelligence Review, 14, 569-590. http://dx.doi.org/10.1023/A:1006676015154

[18]   Liu, H. and Motoda, H. (2001) Feature Extraction, Construction and Selection: A Data Mining Perspective. Second Printing, Kluwer Academic, Boston.

[19]   Ding, C. and Peng, H. (2003) Minimum Redundancy Feature Selection from Microarray Gene Expression Data. Proceedings of the IEEE Computer Society Conference on Bioinformatics, Berkeley, 11-14 August 2003, 523-528.

[20]   Yu, L. and Liu, H. (2003) Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. 20th International Conference on Machine Learning (ICML-03), Washington DC, 21-24 August 2003, 856-863.

[21]   Hariri, S., Yousif, M. and Qu, G. (2005) A New Dependency and Correlation Analysis for Features. IEEE Transactions on Knowledge and Data Engineering, 17, 1199-1207.

[22]   Almuallim, H. and Dietterich, T.G. (1991) Learning with Many Irrelevant Features. Proceeding of 9th National Con- ference on Artificial Intelligence (AAAI-91), Anaheim, 14-19 July 1991, 547-552.

[23]   Kononenko, I. (1994) Estimating Attributes: Analysis and Extensions of RELIEF. Machine Learning: ECML-94, Euro- pean Conference on Machine Learning, Secaucus, 6-8 April 1994, 171-182.

[24]   Kannan, S.S. and Ramraj N. (2010) A Novel Hybrid Feature Selection via Symmetrical Uncertainty Ranking Based Local Memetic Search Algorithm. Knowledge-Based Systems, 23, 580-585.

[25]   Blake, C.L. and Merz, C.J. (2010) UCI Repository of Machine Learning Database. Department of Information and Computer Sciences, University of California, Irvine.