JDAIP  Vol.3 No.4 , November 2015
Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis
Abstract: Social media platforms such as Twitter and the Internet Movie Database (IMDb) contain a vast amount of data which have applications in predictive sentiment analysis for movie sales, stock market fluctuations, brand opinion, or current events. Using a dataset taken from IMDb by Stanford, we identify some of the most significant phrases for identifying sentiment in a wide variety of movie reviews. Data from Twitter are especially attractive due to Twitter’s real-time nature through its streaming API. Effectively analyzing this data in a streaming fashion requires efficient models, which may be improved by reducing the dimensionality of input vectors. One way this has been done in the past is by using emoticons; we propose a method for further reducing these features through identifying common structure in emoticons with similar sentiment. We also examine the gender distribution of emoticon usage, finding tendencies towards certain emoticons to be disproportionate between males and females. Despite the roughly equal gender distribution on Twitter, emoticon usage is predominately female. Furthermore, we find that distributed vector representations, such as those produced by Word2Vec, may be reduced through feature selection. This analysis was done on a manually labeled sample of 1000 tweets from a new dataset, the Large Emoticon Corpus, which consisted of about 8.5 million tweets containing emoticons and was collecting over a five day period in May 2015. Additionally, using the common structure of similar emoticons, we are able to characterize positive and negative emoticons using two regular expressions which account for over 90% of emoticon usage in the Large Emoticon Corpus.
Cite this paper: Dickinson, B. , Ganger, M. and Hu, W. (2015) Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis. Journal of Data Analysis and Information Processing, 3, 153-162. doi: 10.4236/jdaip.2015.34015.

[1]   Rui, H., Liu, Y. and Whinston, A. (2013) Whose and What Chatter Matters? The Effect of Tweets on Movie Sales. Decision Support Systems, 55, 863-870.

[2]   Bollen, J., Mao, H. and Zeng, X.-J. (2011) Twitter Mood Predicts. Journal of Computer Science, 2, 1-8.

[3]   Jansen, B.J., Zhang, M., Sobel, K. and Chowdury, A. (2009) Twitter Power: Tweets as Electronic Word of Mouth. Journal of the American Society for Information Science and Technology, 60, 2169-2188.

[4]   Wang, H., Can, D., Kazemzadeh, A., Bar, F. and Narayanan, S. (2012) A System for Real-Time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, 115-120.

[5]   Bifet, A. and Frank, E. (2010) Sentiment Knowledge Discovery in Twitter Streaming Data. Discovery Science, 1-15.

[6]   Go, A., Bhayani, R. and Huang, L. (2009) Twitter Sentiment. Stanford Digital Library Technologies Project.

[7]   University of Michigan (2011) UMICH SI650—Sentiment Classification.

[8]   Sanders, N.J. (2011) Sanders-Twitter Sentiment Corpus. Sanders Analytics LLC.

[9]   Bird, S., Loper, E. and Klein, E. (2009) Natural Language Processing with Python. O’Reilly Media Inc.

[10]   Le, Q. and Mikolov, T. (2014) Distributed Representations of Sentences and Documents. CoRR, vol. abs/1405.4053.

[11]   Miklov, T., Chen, K., Corrado, G. and Dean, J. (2013) Efficient Estimation of Word Representations in Vector Space.

[12]   Miklov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013) Distributed Representations of Words and Phrases and Their Compositionality. In: Advances in Neural Information Processing Systems, Morgan Kaufmann Publishers Inc., San Francisco, 3111-3119.

[13]   Mikolov, T., Yih, W.-T. and Zweig, G. (2013) Linguistic Regularities in Continuous Space Word Representations. HLT-NAACL, 746-751.

[14]   Huang, P.S., He, X., Gao, J., Deng, L., Acero, A. and Heck, L. (2013) Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data. Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, San Francisco, 27 October-1 November 2013, 2333-2338.

[15]   Shen, Y., He, X., Gao, J., Deng, L. and Mesnil, G. (2014) A Latent Semantic Model with Convolutional-Pooling. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, Shanghai, 3-7 November 2014, 101-110.

[16]   Gao, J., Pantel, P., Gamon, M., He, X., Deng, L. and Shen, Y. (2014) Modeling Interestingness with Deep Neural Networks. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 25-29 October 2014.

[17]   Hall, M.A. (1999) Correlation-Based Feature Selection for Machine Learning. PhD Dissertation, University of Waikato, Waikato.

[18]   Liu, H. and Setiono, R. (1995) Chi2: Feature Selection and Discretization of Numeric Attributes. Proceedings of the IEEE 7th International Conference on Tools with Artificial Intelligence, Herndon, 5-8 November 1995, 388-391.

[19]   Statistica (2015) Number of Active Twitter Users in the United States from 2010 to 2014, by Gender.

[20]   Beevolve (2012) An Exhaustive Study of Twitter Users across the World.

[21]   Miller, Z., Dickinson, B. and Hu, W. (2012) Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features. International Journal of Intelligence Science, 2, 143-148.

[22]   Dietrick, W., Miller, Z., Valyou, B., Dickinson, B., Munson, T. and Hu, W. (2012) Gender Identification on Twitter Using the Modified Balanced Winnow. Communications and Network, 4, 189-195.

[23]   Porter, M.F. (1980) An Algorithm for Suffix Stripping. Program, 14, 130-137.

[24]   Pedregosa, F., et al. (2011) Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.