Minimum Description Length Methods in Bayesian Model Selection: Some Applications

Author(s)
Mohan Delampady

ABSTRACT

Computations involved in Bayesian approach to practical model
selection problems are usually very difficult. Computational
simplifications are sometimes possible, but are not generally applicable. There
is a large literature available on a methodology based on information theory
called Minimum Description Length (MDL). It is described here how many of these
techniques are either directly Bayesian in nature, or are very good objective
approximations to Bayesian solutions. First, connections between the Bayesian
approach and MDL are theoretically explored; thereafter a few illustrations
are provided to describe how MDL can give useful computational simplifications.

Cite this paper

M. Delampady, "Minimum Description Length Methods in Bayesian Model Selection: Some Applications,"*Open Journal of Statistics*, Vol. 3 No. 2, 2013, pp. 103-117. doi: 10.4236/ojs.2013.32012.

M. Delampady, "Minimum Description Length Methods in Bayesian Model Selection: Some Applications,"

References

[1] N. B. Asadi, T. H. Meng and W. H. Wong, “Reconfigur able Computing for Learning Bayesian Networks,” Proceedings of 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, Monterey, 24-26 February, 2008, pp. 203-211.

[2] D. M. Chickering, “Learning Bayesian Networks is NP Complete,” In: D. Fisher and H.-J. Lenz, Eds., Learning from Data: AI and Statistics, V, Springer, Berlin, Heidelberg, New York, 1996, pp. 121-130.

[3] H. Younes, M. Delampady, B. MacGibbon and O. Cherkaoui, “A Hierarchical Bayesian Approach to the Estimation of Monotone Hazard Rates in the Random Censor ship Model,” Journal of Statistical Research, Vol. 41, No. 2, 2007, pp. 35-42.

[4] Y. M. Shtarkov, “Universal Sequential Coding of Single Messages,” Problems of Information Transmission, Vol. 23, No. 3, 1987, pp. 3-17.

[5] J. Rissanen, “Modeling by Shortest Data Description,” Automatica, Vol. 14, No. 5, 1978, pp. 465-471. doi:10.1016/0005-1098(78)90005-5

[6] J. Rissanen, “A Universal Prior for Integers and Estimation by Minimum Description Length,” Annals of Statistics, Vol. 11, No. 2, 1983, pp. 416-431. doi:10.1214/aos/1176346150

[7] C. S. Wallace and P. R. Freeman, “Estimation and Inference by Compact Coding (with Discussion),” Journal of the Royal Statistical Society, Vol. 49, No. 3, 1987, pp. 240-265.

[8] P. M. B. Vitanyi and M. Li, “Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity,” IEEE Transactions on Information Theory, Vol. 46, No. 2, 2000, pp. 446-464. doi:10.1109/18.825807

[9] M. Li and P. Vitanyi, “An Introduction to Kolmogorov Complexity and Its Applications,” 3rd Edition, Springer, Berlin, 2008. doi:10.1007/978-0-387-49820-1

[10] T. M. Cover and J. A. Thomas, “Elements of Information Theory,” Wiley, Hoboken, 2006.

[11] G. H. Choe, “Computational Ergodic Theory,” Springer, New York, 2005.

[12] J. Rissanen, “Stochastic Complexity and Statistical Inquiry,” World Scientific, Singapore, 1989.

[13] B. Yu and M. H. Hansen, “Model Selection and the Principle of Minimum Description Length,” Journal of the American Statistical Association, Vol. 96, No. 454, 2001, pp. 746-774. doi:10.1198/016214501753168398

[14] H. Jeffreys, “Theory of Probability,” 3rd Edition, Oxford University Press, New York, 1961.

[15] J. K. Ghosh, M. Delampady and T. Samanta, “An Introduction to Bayesian Analysis: Theory and Methods,” Springer, New York, 2006.

[16] J. Rissanen, “Stochastic Complexity and Modeling,” Annals of Statistics, Vol. 14, No. 3, 1986, pp. 1080-1100. doi:10.1214/aos/1176350051

[17] J. Rissanen, “Stochastic Complexity (with Discussion),” Journal of the Royal Statistical Society (Series B), Vol. 49, No. 3, 1987, pp. 223-265.

[18] J. Rissanen, “Fisher Information and Stochastic Complex ity,” IEEE Transactions on Information Theory, Vol. 42, No. 1, 1996, pp. 48-54. doi:10.1109/18.481776

[19] G. Schwarz, “Estimating the Dimension of a Model,” Annals of Statistics, Vol. 6, No. 2, 1978, pp. 461-464. doi:10.1214/aos/1176344136

[20] J. Rissanen and G. Shedler, “Failure-Time Prediction,” Journal of Statistical Planning and Inference, Vol. 66, No. 2, 1998, pp. 193-210. doi:10.1016/S0378-3758(97)00083-9

[21] H. Matsuzoe, J. Takeuchi and S. Amari, “Equiaffine Structures on Statistical Manifolds and Bayesian Statistics,” Differential Geometry and Its Applications, Vol. 24, No. 6, 2006, pp. 567-578. doi:10.1016/j.difgeo.2006.02.003

[22] J. Takeuchi, “Characterization of the Bayes Estimator and the MDL Estimator for Exponential Eamilies,” IEEE Transactions on Information Theory, Vol. 43, No. 4, 1997, pp. 1165-1174. doi:10.1109/18.605579

[23] J. O. Berger and L. R. Pericchi, “The Intrinsic Bayes Factor for Linear Models (with Discussion),” In: J. M. Bernardo, et al., Eds., Bayesian Statistics, Oxford University Press, London, 1996, pp. 25-44.

[24] D. P. Foster and E. I. George, “The Risk Inflation Criterion for Multiple Regression,” Annals of Statistics, Vol. 22, No. 4, 1994, pp. 1947-1975. doi:10.1214/aos/1176325766

[25] D. P. Foster and R. A. Stine, “Local Asymptotic Coding and the Minimum Description Length,” IEEE Transactions on Information Theory, Vol. 45, No. 4, 1999, pp. 1289-1293. doi:10.1109/18.761287

[26] P. H. Garthwaite and J. M. Dickey, “Elicitation of Prior Distributions for Variable-selection Problems in Regression,” Annals of Statistics, Vol. 20, No. 4, 1992, pp. 1697-1719. doi:10.1214/aos/1176348886

[27] P. H. Garthwaite and J. M. Dickey, “Quantifying and Using Expert Opinion for Variable-Selection Problems in Regression (with Discussion),” Chemometrics and Intelligent Laboratory Systems, Vol. 35, No. 1, 1996, pp. 1-43. doi:10.1016/S0169-7439(96)00035-4

[28] E. I. George and D. P. Foster, “Calibration and Empirical Bayes Variable Selection,” Biometrika, Vol. 87, No. 4, 2000, pp. 731-747. doi:10.1093/biomet/87.4.731

[29] M. H. Hansen and B. Yu, “Bridging AIC and BIC: An MDL Model Selection Criterion,” Proceedings of the IT Workshop on Detection, Estimation, Classification and Imaging, Santa Fe, 24-26 February 1999, p. 63.

[30] T. J. Mitchell and J. J. Beauchamp, “Bayesian Variable Selection in Linear Regression (with Discussion),” Journal of the American Statistical Association, Vol. 83, No. 404, 1988, pp. 1023-1036. doi:10.1080/01621459.1988.10478694

[31] A. F. M. Smith and D. J. Spiegelhalter, “Bayes Factors and Choice Criteria for Linear Models,” Journal of the Royal Statistical Society, (Series B), Vol. 42, No. 2, 1980, pp. 213-220.

[32] A. Zellner and A. Siow, “Posterior Odds Ratios for Selected Regression Hypotheses,” In: J. M. Bernardo, et al., Eds., Bayesian Statistics, University Press, Valencia, 1980, pp. 585-603.

[33] A. Zellner, “Posterior Odds Ratios for Regression Hypotheses: General Considerations and Some Specific Results,” In: A. Zellner, Ed., Basic Issues in Econometrics, University of Chicago Press, Chicago, 1984, pp. 275-305.

[34] A. Zellner, “On Assessing Prior Distributions and Bayesian Regression Analysis With g-Prior Distributions,” In: P. K. Goel and A. Zellner, Eds., Basic Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, Amsterdam, 1986, pp. 233-243.

[35] G. A. F. Seber, “A Matrix Handbook for Statisticians,” Wiley, Hoboken, 2008, PMid: 19043372.

[36] J. O. Berger, “Statistical Decision Theory and Bayesian Analysis,” 2nd Edition, Springer-Verlag, New York, 1985. doi:10.1007/978-1-4757-4286-2

[37] L. A. Shaw and D. D. Durast, “Measuring the Effects of Weather on Agricultural Output,” ERS-72, US Department of Agriculture, Washington DC, 1962.

[38] L. M. Thompson, “Weather and Technology in the Production of Corn and Soybeans,” CAED Report 17, Iowa State University, Iowa, 1963.

[39] A. Antoniadis, I. Gijbels and G. Grégoire, “Model Selection Using Wavelet Decomposition and Applications,” Biometrika, Vol. 84, No. 4, 1997, pp. 751-763. doi:10.1093/biomet/84.4.751

[40] J.-F. Angers and M. Delampady, “Bayesian Nonparametric Regression Using Wavelets,” Sankhyā: The Indian Journal of Statistics, Vol. 63, No. 3, 2001, pp. 287-308.

[41] M. S. Crouse, R. D. Nowak and R. J. Baraniuk, “Wavelet Based Signal Processing Using Hidden Markov Models,” IEEE Transactions on Signal Processing, Vol. 46, No. 4, 1998, pp. 886-902. doi:10.1109/78.668544

[42] F. Abramovich and T. Sapatinas, “Bayesian Approach to Wavelet Decomposition and Shrinkage,” In: P. Müller and B. Vidakovic, Eds., Bayesian Inference in Wavelet Based Models: Lecture Notes in Statistics, Vol. 141, Springer, New York, 1999, pp. 33-50.

[43] B. Vidakovic, “Nonlinear Wavelet Shrinkage with Bayes Rules and Bayes Factors,” Journal of the American Statistical Association, Vol. 93, No. 441, 1998, pp. 173-179. doi:10.1080/01621459.1998.10474099

[44] T. C. M. Lee, “Tree-Based Wavelet Regression for Correlated Data Using the Minimum Description Principle,” Australian & New Zealand Journal of Statistics, Vol. 44, No. 1, 2002, pp. 23-39. doi:10.1111/1467-842X.00205

[45] A. C. Harvey and J. Durbin, “The Effects of Seat Belt Legislation on British Road Casualties: A Case Study in Structural Time Series Modelling (with Discussion),” Journal of the Royal Statistical Society (Series A), Vol. 149, No. 3, 1986, pp. 187-227. doi:10.2307/2981553

[46] M. Delampady, I. Yee and J. V. Zidek, “Hierarchical Bayesian Analysis of a Discrete Time Series of Poisson Counts,” Statistics and Computing, Vol. 3, No. 1, 1993, pp. 7-15. doi:10.1007/BF00146948

[47] D. V. Lindley and A. F. M. Smith, “Bayes Estimates for the Linear Model,” Journal of the Royal Statistical Society (Series B), Vol. 34, 1972, pp. 1-41.

[48] P. D. Grunwald, “The Minimum Description Length Principle,” MIT Press, Cambridge, Massachusetts, 2007.

[49] S. de Rooij and P. Grunwald, “An Empirical Study of Minimum Description Length Model Selection with Infinite Parametric Complexity,” Journal of Mathematical Psychology, Vol. 50, No. 2, 2006, pp. 180-192. doi:10.1016/j.jmp.2005.11.008

[50] A. Barron, J. Rissanen and B. Yu, “The Minimum Description Length Principle in Coding and Modeling,” IEEE Transactions on Information Theory, Vol. 44, No. 6, 1998, pp. 2743-2760. doi:10.1109/18.720554s

[1] N. B. Asadi, T. H. Meng and W. H. Wong, “Reconfigur able Computing for Learning Bayesian Networks,” Proceedings of 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, Monterey, 24-26 February, 2008, pp. 203-211.

[2] D. M. Chickering, “Learning Bayesian Networks is NP Complete,” In: D. Fisher and H.-J. Lenz, Eds., Learning from Data: AI and Statistics, V, Springer, Berlin, Heidelberg, New York, 1996, pp. 121-130.

[3] H. Younes, M. Delampady, B. MacGibbon and O. Cherkaoui, “A Hierarchical Bayesian Approach to the Estimation of Monotone Hazard Rates in the Random Censor ship Model,” Journal of Statistical Research, Vol. 41, No. 2, 2007, pp. 35-42.

[4] Y. M. Shtarkov, “Universal Sequential Coding of Single Messages,” Problems of Information Transmission, Vol. 23, No. 3, 1987, pp. 3-17.

[5] J. Rissanen, “Modeling by Shortest Data Description,” Automatica, Vol. 14, No. 5, 1978, pp. 465-471. doi:10.1016/0005-1098(78)90005-5

[6] J. Rissanen, “A Universal Prior for Integers and Estimation by Minimum Description Length,” Annals of Statistics, Vol. 11, No. 2, 1983, pp. 416-431. doi:10.1214/aos/1176346150

[7] C. S. Wallace and P. R. Freeman, “Estimation and Inference by Compact Coding (with Discussion),” Journal of the Royal Statistical Society, Vol. 49, No. 3, 1987, pp. 240-265.

[8] P. M. B. Vitanyi and M. Li, “Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity,” IEEE Transactions on Information Theory, Vol. 46, No. 2, 2000, pp. 446-464. doi:10.1109/18.825807

[9] M. Li and P. Vitanyi, “An Introduction to Kolmogorov Complexity and Its Applications,” 3rd Edition, Springer, Berlin, 2008. doi:10.1007/978-0-387-49820-1

[10] T. M. Cover and J. A. Thomas, “Elements of Information Theory,” Wiley, Hoboken, 2006.

[11] G. H. Choe, “Computational Ergodic Theory,” Springer, New York, 2005.

[12] J. Rissanen, “Stochastic Complexity and Statistical Inquiry,” World Scientific, Singapore, 1989.

[13] B. Yu and M. H. Hansen, “Model Selection and the Principle of Minimum Description Length,” Journal of the American Statistical Association, Vol. 96, No. 454, 2001, pp. 746-774. doi:10.1198/016214501753168398

[14] H. Jeffreys, “Theory of Probability,” 3rd Edition, Oxford University Press, New York, 1961.

[15] J. K. Ghosh, M. Delampady and T. Samanta, “An Introduction to Bayesian Analysis: Theory and Methods,” Springer, New York, 2006.

[16] J. Rissanen, “Stochastic Complexity and Modeling,” Annals of Statistics, Vol. 14, No. 3, 1986, pp. 1080-1100. doi:10.1214/aos/1176350051

[17] J. Rissanen, “Stochastic Complexity (with Discussion),” Journal of the Royal Statistical Society (Series B), Vol. 49, No. 3, 1987, pp. 223-265.

[18] J. Rissanen, “Fisher Information and Stochastic Complex ity,” IEEE Transactions on Information Theory, Vol. 42, No. 1, 1996, pp. 48-54. doi:10.1109/18.481776

[19] G. Schwarz, “Estimating the Dimension of a Model,” Annals of Statistics, Vol. 6, No. 2, 1978, pp. 461-464. doi:10.1214/aos/1176344136

[20] J. Rissanen and G. Shedler, “Failure-Time Prediction,” Journal of Statistical Planning and Inference, Vol. 66, No. 2, 1998, pp. 193-210. doi:10.1016/S0378-3758(97)00083-9

[21] H. Matsuzoe, J. Takeuchi and S. Amari, “Equiaffine Structures on Statistical Manifolds and Bayesian Statistics,” Differential Geometry and Its Applications, Vol. 24, No. 6, 2006, pp. 567-578. doi:10.1016/j.difgeo.2006.02.003

[22] J. Takeuchi, “Characterization of the Bayes Estimator and the MDL Estimator for Exponential Eamilies,” IEEE Transactions on Information Theory, Vol. 43, No. 4, 1997, pp. 1165-1174. doi:10.1109/18.605579

[23] J. O. Berger and L. R. Pericchi, “The Intrinsic Bayes Factor for Linear Models (with Discussion),” In: J. M. Bernardo, et al., Eds., Bayesian Statistics, Oxford University Press, London, 1996, pp. 25-44.

[24] D. P. Foster and E. I. George, “The Risk Inflation Criterion for Multiple Regression,” Annals of Statistics, Vol. 22, No. 4, 1994, pp. 1947-1975. doi:10.1214/aos/1176325766

[25] D. P. Foster and R. A. Stine, “Local Asymptotic Coding and the Minimum Description Length,” IEEE Transactions on Information Theory, Vol. 45, No. 4, 1999, pp. 1289-1293. doi:10.1109/18.761287

[26] P. H. Garthwaite and J. M. Dickey, “Elicitation of Prior Distributions for Variable-selection Problems in Regression,” Annals of Statistics, Vol. 20, No. 4, 1992, pp. 1697-1719. doi:10.1214/aos/1176348886

[27] P. H. Garthwaite and J. M. Dickey, “Quantifying and Using Expert Opinion for Variable-Selection Problems in Regression (with Discussion),” Chemometrics and Intelligent Laboratory Systems, Vol. 35, No. 1, 1996, pp. 1-43. doi:10.1016/S0169-7439(96)00035-4

[28] E. I. George and D. P. Foster, “Calibration and Empirical Bayes Variable Selection,” Biometrika, Vol. 87, No. 4, 2000, pp. 731-747. doi:10.1093/biomet/87.4.731

[29] M. H. Hansen and B. Yu, “Bridging AIC and BIC: An MDL Model Selection Criterion,” Proceedings of the IT Workshop on Detection, Estimation, Classification and Imaging, Santa Fe, 24-26 February 1999, p. 63.

[30] T. J. Mitchell and J. J. Beauchamp, “Bayesian Variable Selection in Linear Regression (with Discussion),” Journal of the American Statistical Association, Vol. 83, No. 404, 1988, pp. 1023-1036. doi:10.1080/01621459.1988.10478694

[31] A. F. M. Smith and D. J. Spiegelhalter, “Bayes Factors and Choice Criteria for Linear Models,” Journal of the Royal Statistical Society, (Series B), Vol. 42, No. 2, 1980, pp. 213-220.

[32] A. Zellner and A. Siow, “Posterior Odds Ratios for Selected Regression Hypotheses,” In: J. M. Bernardo, et al., Eds., Bayesian Statistics, University Press, Valencia, 1980, pp. 585-603.

[33] A. Zellner, “Posterior Odds Ratios for Regression Hypotheses: General Considerations and Some Specific Results,” In: A. Zellner, Ed., Basic Issues in Econometrics, University of Chicago Press, Chicago, 1984, pp. 275-305.

[34] A. Zellner, “On Assessing Prior Distributions and Bayesian Regression Analysis With g-Prior Distributions,” In: P. K. Goel and A. Zellner, Eds., Basic Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, Amsterdam, 1986, pp. 233-243.

[35] G. A. F. Seber, “A Matrix Handbook for Statisticians,” Wiley, Hoboken, 2008, PMid: 19043372.

[36] J. O. Berger, “Statistical Decision Theory and Bayesian Analysis,” 2nd Edition, Springer-Verlag, New York, 1985. doi:10.1007/978-1-4757-4286-2

[37] L. A. Shaw and D. D. Durast, “Measuring the Effects of Weather on Agricultural Output,” ERS-72, US Department of Agriculture, Washington DC, 1962.

[38] L. M. Thompson, “Weather and Technology in the Production of Corn and Soybeans,” CAED Report 17, Iowa State University, Iowa, 1963.

[39] A. Antoniadis, I. Gijbels and G. Grégoire, “Model Selection Using Wavelet Decomposition and Applications,” Biometrika, Vol. 84, No. 4, 1997, pp. 751-763. doi:10.1093/biomet/84.4.751

[40] J.-F. Angers and M. Delampady, “Bayesian Nonparametric Regression Using Wavelets,” Sankhyā: The Indian Journal of Statistics, Vol. 63, No. 3, 2001, pp. 287-308.

[41] M. S. Crouse, R. D. Nowak and R. J. Baraniuk, “Wavelet Based Signal Processing Using Hidden Markov Models,” IEEE Transactions on Signal Processing, Vol. 46, No. 4, 1998, pp. 886-902. doi:10.1109/78.668544

[42] F. Abramovich and T. Sapatinas, “Bayesian Approach to Wavelet Decomposition and Shrinkage,” In: P. Müller and B. Vidakovic, Eds., Bayesian Inference in Wavelet Based Models: Lecture Notes in Statistics, Vol. 141, Springer, New York, 1999, pp. 33-50.

[43] B. Vidakovic, “Nonlinear Wavelet Shrinkage with Bayes Rules and Bayes Factors,” Journal of the American Statistical Association, Vol. 93, No. 441, 1998, pp. 173-179. doi:10.1080/01621459.1998.10474099

[44] T. C. M. Lee, “Tree-Based Wavelet Regression for Correlated Data Using the Minimum Description Principle,” Australian & New Zealand Journal of Statistics, Vol. 44, No. 1, 2002, pp. 23-39. doi:10.1111/1467-842X.00205

[45] A. C. Harvey and J. Durbin, “The Effects of Seat Belt Legislation on British Road Casualties: A Case Study in Structural Time Series Modelling (with Discussion),” Journal of the Royal Statistical Society (Series A), Vol. 149, No. 3, 1986, pp. 187-227. doi:10.2307/2981553

[46] M. Delampady, I. Yee and J. V. Zidek, “Hierarchical Bayesian Analysis of a Discrete Time Series of Poisson Counts,” Statistics and Computing, Vol. 3, No. 1, 1993, pp. 7-15. doi:10.1007/BF00146948

[47] D. V. Lindley and A. F. M. Smith, “Bayes Estimates for the Linear Model,” Journal of the Royal Statistical Society (Series B), Vol. 34, 1972, pp. 1-41.

[48] P. D. Grunwald, “The Minimum Description Length Principle,” MIT Press, Cambridge, Massachusetts, 2007.

[49] S. de Rooij and P. Grunwald, “An Empirical Study of Minimum Description Length Model Selection with Infinite Parametric Complexity,” Journal of Mathematical Psychology, Vol. 50, No. 2, 2006, pp. 180-192. doi:10.1016/j.jmp.2005.11.008

[50] A. Barron, J. Rissanen and B. Yu, “The Minimum Description Length Principle in Coding and Modeling,” IEEE Transactions on Information Theory, Vol. 44, No. 6, 1998, pp. 2743-2760. doi:10.1109/18.720554s