The Role of Rare Terms in Enhancing the Performance of Polynomial Networks Based Text Categorization

Author(s)
Mayy M. Al-Tahrawi

Affiliation(s)

Department of Computer Information Systems, Faculty of Information Technology, Al-Ahliyya Amman University, Amman, Jordan..

Department of Computer Information Systems, Faculty of Information Technology, Al-Ahliyya Amman University, Amman, Jordan..

ABSTRACT

In this paper, the role of rare or infrequent terms in enhancing the accuracy of English Text Categorization using Polynomial Networks (PNs) is investigated. To study the impact of rare terms in enhancing the accuracy of PNs-based text categorization, different term reduction criteria as well as different term weighting schemes were experimented on the Reuters Corpus using PNs. Each term weighting scheme on each reduced term set was tested once keeping the rare terms and another time removing them. All the experiments conducted in this research show that keeping rare terms substantially improves the performance of Polynomial Networks in Text Categorization, regardless of the term reduction method, the number of terms used in classification, or the term weighting scheme adopted.

In this paper, the role of rare or infrequent terms in enhancing the accuracy of English Text Categorization using Polynomial Networks (PNs) is investigated. To study the impact of rare terms in enhancing the accuracy of PNs-based text categorization, different term reduction criteria as well as different term weighting schemes were experimented on the Reuters Corpus using PNs. Each term weighting scheme on each reduced term set was tested once keeping the rare terms and another time removing them. All the experiments conducted in this research show that keeping rare terms substantially improves the performance of Polynomial Networks in Text Categorization, regardless of the term reduction method, the number of terms used in classification, or the term weighting scheme adopted.

KEYWORDS

Polynomial Networks; Text Categorization; Document Classification; Infrequent Terms; Rare Terms

Polynomial Networks; Text Categorization; Document Classification; Infrequent Terms; Rare Terms

Cite this paper

M. Al-Tahrawi, "The Role of Rare Terms in Enhancing the Performance of Polynomial Networks Based Text Categorization,"*Journal of Intelligent Learning Systems and Applications*, Vol. 5 No. 2, 2013, pp. 84-89. doi: 10.4236/jilsa.2013.52009.

M. Al-Tahrawi, "The Role of Rare Terms in Enhancing the Performance of Polynomial Networks Based Text Categorization,"

References

[1] M. M. AL-Tahrawi and R. Abu Zitar, “Polynomial Networks versus Other Techniques in Text Categorization,” International Journal of Pattern Recognition and Artificial Intelligence, Vol. 22, No. 2, 2008, pp. 295-322. doi:10.1142/S0218001408006247

[2] R. Bekkerman, “Distributional Clustering of Words for Text Categorization,” M.S. Thesis, Israel Institute of Technology, Haifa, 2003.

[3] D. Koller and M. Sahami, “Hierarchically Classifying Documents Using Very Few Words,” The 14th International Conference on Machine Learning (ICML’97), Nashville, July 1997, pp. 170-178.

[4] D. Wang and H. Zhang, “Inverse-Category-Frequency based Supervised Term Weighting Scheme for Text Categorization,” Journal of Information Science and Engineering, 2010.

[5] C. Deisy, M. Gowri, S. Baskar, S. M. A. Kalaiarasi and N. Ramraj, “A Novel Term Weighting Scheme MIDF for Text Categorization,” Journal of Engineering Science and Technology, Vol. 5, No. 1, 2010, pp. 94-107.

[6] P. Schonhofen and A. A. Benczur, “Exploiting Extremely Rare Terms in Text Categorization,” Lecture Notes in Computer Science, Vol. 4212, 2006, pp. 759-766.

[7] K. Fukunaga, “Introduction to Statistical Pattern Recognition,” Academic Press, Cambridge, 1990.

[8] W. M. Campbell, K. T. Assaleh and C. C. Broun, “A Novel Algorithm for Training Polynomial Networks,” International NAISO Symposium on Information Science Innovations ISI’2001, Dubai, March 2001.

[9] K. T. Assaleh and M. AL Rousan, “A New Method for Arabic Sign Language Recognition,” Personal Communications, 2004.

[10] W. M. Campbell and C. C. Boun, “Using Polynomial Networks for Speech Recognition,” Personal Communications, 2004.

[11] W. M. Campbell and K. T. Assaleh, “Polynomial Classifier Techniques for Speaker verification,” Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Phoenix, 15-19 March 1999, pp. 321 324.

[12] K. T. Assaleh and W. M. Campbell, “Speaker Identification Using a Polynomial-Based Classifier,” International Symposium on Signal Processing and Its Applications, Brisbane, 22-25 August 1999, pp. 115-118.

[13] G. H. Golub and C. F. Van Loan, “Matrix Computations,” John Hopkins, Washington DC, 1989.

[14] Ana Site for Data Sets Suitable for Single-Label Text Categorization. http://www.gia.ist.utl.pt/~acardoso/datasets/

[15] M. F. Porter, “An Algorithm for Suffix Stripping,” Program, Vol. 14, No. 3, 1980, pp. 130-137. doi:10.1108/eb046814

[16] G. Forman, “An Extensive Empirical Study of Term Se lection Metrics for Text Classification,” Journal of Ma chine Learning Research, Vol. 3, 2003, pp. 1289-1305.

[17] Y. Yang and J. Pederson, “A Comparative Study on Term Selection in Text Categorization,” Proceedings of the 14th International Conference on Machine Learning, 1997, pp. 412-420.

[18] K. Fuka and R. Hanka, “Feature Set Reduction for Document Classification Problems,” IJCAI-01 Workshop: Text Learning: Beyond Supervision, Seattle, August 2001, 2001.

[19] M. Rogati and Y. Yang, “High-Performing Feature Selection for Text Classification,” CIKM’02, November 2002, pp. 4-9.

[20] Z. Zheng, X. Wu and R. Srihari, “Term Selection for Text Categorization on Imbalanced Data,” SIGKDD Explorations, Vol. 6, No. 1, 2004, pp. 80-89. doi:10.1145/1007730.1007741

[1] M. M. AL-Tahrawi and R. Abu Zitar, “Polynomial Networks versus Other Techniques in Text Categorization,” International Journal of Pattern Recognition and Artificial Intelligence, Vol. 22, No. 2, 2008, pp. 295-322. doi:10.1142/S0218001408006247

[2] R. Bekkerman, “Distributional Clustering of Words for Text Categorization,” M.S. Thesis, Israel Institute of Technology, Haifa, 2003.

[3] D. Koller and M. Sahami, “Hierarchically Classifying Documents Using Very Few Words,” The 14th International Conference on Machine Learning (ICML’97), Nashville, July 1997, pp. 170-178.

[4] D. Wang and H. Zhang, “Inverse-Category-Frequency based Supervised Term Weighting Scheme for Text Categorization,” Journal of Information Science and Engineering, 2010.

[5] C. Deisy, M. Gowri, S. Baskar, S. M. A. Kalaiarasi and N. Ramraj, “A Novel Term Weighting Scheme MIDF for Text Categorization,” Journal of Engineering Science and Technology, Vol. 5, No. 1, 2010, pp. 94-107.

[6] P. Schonhofen and A. A. Benczur, “Exploiting Extremely Rare Terms in Text Categorization,” Lecture Notes in Computer Science, Vol. 4212, 2006, pp. 759-766.

[7] K. Fukunaga, “Introduction to Statistical Pattern Recognition,” Academic Press, Cambridge, 1990.

[8] W. M. Campbell, K. T. Assaleh and C. C. Broun, “A Novel Algorithm for Training Polynomial Networks,” International NAISO Symposium on Information Science Innovations ISI’2001, Dubai, March 2001.

[9] K. T. Assaleh and M. AL Rousan, “A New Method for Arabic Sign Language Recognition,” Personal Communications, 2004.

[10] W. M. Campbell and C. C. Boun, “Using Polynomial Networks for Speech Recognition,” Personal Communications, 2004.

[11] W. M. Campbell and K. T. Assaleh, “Polynomial Classifier Techniques for Speaker verification,” Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Phoenix, 15-19 March 1999, pp. 321 324.

[12] K. T. Assaleh and W. M. Campbell, “Speaker Identification Using a Polynomial-Based Classifier,” International Symposium on Signal Processing and Its Applications, Brisbane, 22-25 August 1999, pp. 115-118.

[13] G. H. Golub and C. F. Van Loan, “Matrix Computations,” John Hopkins, Washington DC, 1989.

[14] Ana Site for Data Sets Suitable for Single-Label Text Categorization. http://www.gia.ist.utl.pt/~acardoso/datasets/

[15] M. F. Porter, “An Algorithm for Suffix Stripping,” Program, Vol. 14, No. 3, 1980, pp. 130-137. doi:10.1108/eb046814

[16] G. Forman, “An Extensive Empirical Study of Term Se lection Metrics for Text Classification,” Journal of Ma chine Learning Research, Vol. 3, 2003, pp. 1289-1305.

[17] Y. Yang and J. Pederson, “A Comparative Study on Term Selection in Text Categorization,” Proceedings of the 14th International Conference on Machine Learning, 1997, pp. 412-420.

[18] K. Fuka and R. Hanka, “Feature Set Reduction for Document Classification Problems,” IJCAI-01 Workshop: Text Learning: Beyond Supervision, Seattle, August 2001, 2001.

[19] M. Rogati and Y. Yang, “High-Performing Feature Selection for Text Classification,” CIKM’02, November 2002, pp. 4-9.

[20] Z. Zheng, X. Wu and R. Srihari, “Term Selection for Text Categorization on Imbalanced Data,” SIGKDD Explorations, Vol. 6, No. 1, 2004, pp. 80-89. doi:10.1145/1007730.1007741