Dirichlet Compound Multinomials Statistical Models

Show more

This
contribution deals with a generative approach for the analysis of textual data.
Instead of creating heuristic rules forthe representation
of documents and word counts, we employ a distribution able to model words
along texts considering different topics. In this regard, following Minka
proposal (2003), we implement a Dirichlet Compound Multinomial (*DCM*) distribution, then we propose an
extension called *sbDCM *that takes
explicitly into account the different latent topics that compound the document.
We follow two alternative approaches: on one hand the topics can be unknown,
thus to be estimated on the basis of the data, on the other hand topics are
determined in advance on the basis of a predefined ontological schema. The two
possible approaches are assessed on the basis of real data.

References

[1] S. Deerwester, S. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, Vol. 41, No. 6, 1990, pp. 391-407.
doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

[2] T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proceedings of Special Interest Group on Information Retrieval, New York, 1999, pp. 50-57.

[3] D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, Vol. 3, 2003, pp. 993-1022.

[4] M. Girolami and A. Kaban, “On an Equivalence between PLSI and LDA,” Proceedings of Special Interest Group on Information Retrieval, New York, 2003, pp. 433-434.

[5] D. M. Blei and J. D. Lafferty, “Correlated Topic Models,” Advances in Neural Information Processing Systems, Vol. 18, 2006, pp. 1-47.

[6] D. Putthividhya, H. T. Attias and S. S. Nagarajan, “Independent Factor Topic Models,” Proceeding of International Conference on Machine Learning, New York, 2009, pp. 833-840.

[7] J. E. Mosimann, “On the Compound Multinomial Distribution, the Multivariate B-Distribution, and Correlations among Proportions,” Biometrika, Vol. 49, No. 1-2, 1962, pp. 65-82.

[8] K. Sjolander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I. S. Mian and D. Haussler, “Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology,” Computer Applications in the Biosciences, Vol. 12, No. 4, 1996, pp. 327-345.

[9] D. J. C. Mackay and L. Peto, “A Hierarchical Dirichlet Language Model,” Natural Language Engineering, Vol. 1, No. 3, 1994, pp. 1-19.

[10] T. Minka, “Estimating a Dirichlet distribution,” Unpublished Paper, 2003.
http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/

[11] R. E. Madsen, D. Kauchak and C. Elkan, “Modeling Word Burstiness Using the Dirichlet Distribution,” Proceeding of the 22nd International Conference on Machine Learning, New York, 2005, pp. 545-552.

[12] G. Doyle and C. Elkan, “Accounting for Burstiness in Topic Models,” Proceeding of International Conference on Machine Learning, New York, 2009, pp. 281-288.

[13] J. D. M. Rennie, L. Shih, J. Teevan and D. R. Karge, “Tackling the Poor Assumptions of Naive Bayes Text Classifier,” Proceeding of the 20th International Conference on Machine Learning, Washington DC, 2003, 6 p.

[14] A. P. Dempster, M. N. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, Vol. 39, No. 1, 1977, pp. 1-38.

[15] D. B?hning, “The EM Algorithm with Gradient Function Update for Discrete Mixture with Know (Fixed) Number of Components,” Statistics and Computing, Vol. 13, No. 3, 2003, pp. 257-265. doi:10.1023/A:1024222817645

[16] S. Staab and R. Studer, “Handbook on Ontologies, International Handbooks on Information Systems,” 2nd Edition, Springer, Berlin, 2009.

[17] P. Cerchiello, “Statistical Models to Measure Corporate Reputation,” Journal of Applied Quantitative Methods, Vol. 6, No. 4, 2011, pp. 58-71.