contribution deals with a generative approach for the analysis of textual data.
Instead of creating heuristic rules forthe representation
of documents and word counts, we employ a distribution able to model words
along texts considering different topics. In this regard, following Minka
proposal (2003), we implement a Dirichlet Compound Multinomial (DCM) distribution, then we propose an
extension called sbDCM that takes
explicitly into account the different latent topics that compound the document.
We follow two alternative approaches: on one hand the topics can be unknown,
thus to be estimated on the basis of the data, on the other hand topics are
determined in advance on the basis of a predefined ontological schema. The two
possible approaches are assessed on the basis of real data.
Cite this paper
P. Cerchiello and P. Giudici, "Dirichlet Compound Multinomials Statistical Models," Applied Mathematics
, Vol. 3 No. 12, 2012, pp. 2089-2097. doi: 10.4236/am.2012.312A288
 S. Deerwester, S. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman, “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, Vol. 41, No. 6, 1990, pp. 391-407.
 T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proceedings of Special Interest Group on Information Retrieval, New York, 1999, pp. 50-57.
 D. M. Blei, A. Y. Ng and M. I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, Vol. 3, 2003, pp. 993-1022.
 M. Girolami and A. Kaban, “On an Equivalence between PLSI and LDA,” Proceedings of Special Interest Group on Information Retrieval, New York, 2003, pp. 433-434.
 D. M. Blei and J. D. Lafferty, “Correlated Topic Models,” Advances in Neural Information Processing Systems, Vol. 18, 2006, pp. 1-47.
 D. Putthividhya, H. T. Attias and S. S. Nagarajan, “Independent Factor Topic Models,” Proceeding of International Conference on Machine Learning, New York, 2009, pp. 833-840.
 J. E. Mosimann, “On the Compound Multinomial Distribution, the Multivariate B-Distribution, and Correlations among Proportions,” Biometrika, Vol. 49, No. 1-2, 1962, pp. 65-82.
 K. Sjolander, K. Karplus, M. Brown, R. Hughey, A. Krogh, I. S. Mian and D. Haussler, “Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology,” Computer Applications in the Biosciences, Vol. 12, No. 4, 1996, pp. 327-345.
 D. J. C. Mackay and L. Peto, “A Hierarchical Dirichlet Language Model,” Natural Language Engineering, Vol. 1, No. 3, 1994, pp. 1-19.
 T. Minka, “Estimating a Dirichlet distribution,” Unpublished Paper, 2003.
 R. E. Madsen, D. Kauchak and C. Elkan, “Modeling Word Burstiness Using the Dirichlet Distribution,” Proceeding of the 22nd International Conference on Machine Learning, New York, 2005, pp. 545-552.
 G. Doyle and C. Elkan, “Accounting for Burstiness in Topic Models,” Proceeding of International Conference on Machine Learning, New York, 2009, pp. 281-288.
 J. D. M. Rennie, L. Shih, J. Teevan and D. R. Karge, “Tackling the Poor Assumptions of Naive Bayes Text Classifier,” Proceeding of the 20th International Conference on Machine Learning, Washington DC, 2003, 6 p.
 A. P. Dempster, M. N. Laird and D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Series B, Vol. 39, No. 1, 1977, pp. 1-38.
 D. B?hning, “The EM Algorithm with Gradient Function Update for Discrete Mixture with Know (Fixed) Number of Components,” Statistics and Computing, Vol. 13, No. 3, 2003, pp. 257-265. doi:10.1023/A:1024222817645
 S. Staab and R. Studer, “Handbook on Ontologies, International Handbooks on Information Systems,” 2nd Edition, Springer, Berlin, 2009.
 P. Cerchiello, “Statistical Models to Measure Corporate Reputation,” Journal of Applied Quantitative Methods, Vol. 6, No. 4, 2011, pp. 58-71.