JILSA  Vol.8 No.4 , November 2016
Improved Term Weighting Technique for Automatic Web Page Classification
Abstract: Automatic web page classification has become inevitable for web directories due to the multitude of web pages in the World Wide Web. In this paper an improved Term Weighting technique is proposed for automatic and effective classification of web pages. The web documents are represented as set of features. The proposed method selects and extracts the most prominent features reducing the high dimensionality problem of classifier. The proper selection of features among the large set improves the performance of the classifier. The proposed algorithm is implemented and tested on a benchmarked dataset. The results show the better performance than most of the existing term weighting techniques.
Cite this paper: Thangairulappan, K. and Kanagavel, A. (2016) Improved Term Weighting Technique for Automatic Web Page Classification. Journal of Intelligent Learning Systems and Applications, 8, 63-76. doi: 10.4236/jilsa.2016.84006.

[1]   Qi, X.G. and Davison, B.D. (2009) Web Page Classification: Features and Algorithms. ACM Computing Surveys, 41, 12:1-12:31.

[2]   McCallum, A. and Nigam, K. (1998) A Comparison of Event Models for Naive Bayes Text Classification. Proceedings in Workshop on Learning for Text Categorization, AAAI’98, 41-48.

[3]   Lewis, D.D., Schapire, R.E., Callan, J.P. and Papka, R. (1996) Training Algorithms for Linear Text Classifiers. Proceedings of 19th International Conference on Research and development in Information Retrieval, ACM, New York, 289-297.

[4]   Yang, Y., Slattery, S. and Ghani, R. (2002) A Study of Approaches to Hypertext Categorization. Journal of Information Systems, 18, 2-3.

[5]   Kamruzzaman, S.M. (2006) Web Page Categorization Using Artificial Neural Networks. Proceedings of the 4th Intl Conf. on Electrical Engg. & 2nd Annual Paper Meet, Bangladesh, January 2006, 26-28.

[6]   Selamat, A. and Omatu, S. (2004) Web Page Feature Selection and Classification Using Neural Networks. Information Sciences, 158, 69-88.

[7]   Selamat, A., Lee, Z.S., Maarof, M.A. and Shamsuddin, S.M. (2011) Improved Web Page Identification Method using Neural Networks. International Journal of Computational Intelligence and Applications, 10, 87-114.

[8]   Sabbah, T., Selamat, A., Selamat, M.H., Ibrahim, R. and Fujita, H. (2016) Hybridized Term Weighting Method for Web Contents Classification using SVM. Neuro Computing, 173, 1908-1926.

[9]   Shanthi, S.G. and Thanamani, A.S. (2012)Enhanced Approach on Web Page Classification Using Machine Learning Technique. International Journal of Advanced Research in Computer Engineering & Technology, 1, 278-282.

[10]   Alamelu Mangai, J., Santhosh Kumar, V. and Appavu Balamurugan, S. (2013) A Novel Approach for Effective Web Page Classification. International Journal of Data Mining Modelling and Management, 5, 233-245.

[11]   Dutta, R., Kundu, A. and Mukhopadhyay, D. (2011) Clustering-Based Web Page Prediction. International Journal of Knowledge and Web Intelligence, 2, 257-271.

[12]   Luo, Q.M., Chen, E.H. and Xiong, H. (2011) A Semantic Term Weighting Scheme for Text Categorization. Expert Systems with Applications, 38, 12708-12716.

[13]   Lan, M., Tan, C.L., Su, J. and Lu, Y. (2009) Supervised and Traditional Term Weighting Methods for Automatic Text Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 721-35.

[14]   Buckley, C. (1993) The Importance of Proper Weighting Methods. Proceedings of the workshop on Human Language Technology—HLT’93, Association for Computational Linguistics, Stroudsburg, 349-352.

[15]   Bharti, K.K. and Singh, P.K. (2015) Hybrid Dimension Reduction by Integrating Feature Selection with Feature Extraction Method for Text Clustering. Expert Systems with Applications, 42, 3105-3114.

[16]   Naderalvojoud, B., Bozkir, A.S. and Sezer, E.A. (2014) Investigation of Term Weighting Schemes in Classification of Imbalanced Texts. Proceedings of European Conference on Data Mining (ECDM), Lisbon, 15-17 July 2014, 39-46.

[17]   Debole, F. and Sebastiani, F. (2003) Supervised Term Weighting for Automated Text Categorization. Proceedings of the 18th ACM Symposium on Applied Computing (SAC 2003), Melbourne, 9-12 March 2003, 784-788.

[18]   20 Newsgroup Dataset.

[19]   Ko, Y. (2012) A Study of Term Weighting Schemes Using Class Information for Text Classification. Proceedings of SIGIR’12, Portland, Oregon, 12-16 August 2012, 1029-1030.