Cross-Validation, Shrinkage and Variable Selection in Linear Regression Revisited

Affiliation(s)

Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands.

Institut fuer Medizinische Biometrie und Medizinische Informatik, Universitaetsklinikum Freiburg, Freiburg, Germany.

Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands.

Institut fuer Medizinische Biometrie und Medizinische Informatik, Universitaetsklinikum Freiburg, Freiburg, Germany.

ABSTRACT

In deriving a regression model analysts often have to use variable
selection, despite of problems introduced by data- dependent model
building. Resampling approaches are proposed to handle some of the critical
issues. In order to assess and compare several strategies, we will conduct a
simulation study with 15 predictors and a complex correlation structure in
the linear regression model. Using sample sizes of 100 and 400 and estimates of
the residual variance corresponding to *R*^{2} of 0.50 and 0.71, we consider 4 scenarios with varying amount of information.
We also consider two examples with 24 and 13 predictors, respectively. We will
discuss the value of cross-validation, shrinkage and backward
elimination (BE) with varying significance level. We will assess whether 2-step
approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to
models derived with the LASSO procedure. Beside of MSE we will use model
sparsity and further criteria for model assessment. The amount of information
in the data has an influence on the selected models and the comparison of the
procedures. None of the approaches was best in all scenarios. The
performance of backward elimination with a suitably chosen significance level
was not worse compared to the LASSO and BE models selected were much sparser,
an important advantage for interpretation and transportability. Compared to
global shrinkage, PWSF had better performance. Provided that the amount of
information is not too small, we conclude that BE followed by PWSF is a suitable
approach when variable selection is a key part of data analysis.

Cite this paper

H. Houwelingen and W. Sauerbrei, "Cross-Validation, Shrinkage and Variable Selection in Linear Regression Revisited,"*Open Journal of Statistics*, Vol. 3 No. 2, 2013, pp. 79-102. doi: 10.4236/ojs.2013.32011.

H. Houwelingen and W. Sauerbrei, "Cross-Validation, Shrinkage and Variable Selection in Linear Regression Revisited,"

References

[1] C. Chen and S. L. George, “The Bootstrap and Identification of Prognostic Factors via Cox’s Proportional Hazards Regression Model,” Statistics in Medicine, Vol. 4, No. 1, 1985, pp. 39-46. doi:10.1002/sim.4780040107

[2] J. C. van Houwelingen and S. le Cessie, “Predictive Value of Statistical Models,” Statistics in Medicine, Vol. 9, No. 11, 1990, pp. 1303-1325. doi:10.1002/sim.4780091109

[3] F. E. Harrell, K. L. Lee and D. B. Mark, “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors,” Statistics in Medicine, Vol. 15, No. 4, 1996, pp. 361-387. doi:10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4

[4] W. Sauerbrei, “The Use of Resampling Methods to Simplify Regression Models in Medical Statistics,” Journal of the Royal Statistical Society Series C—Applied Statis tics, Vol. 48, No. 3, 1999, pp. 313-329. doi:10.1111/1467-9876.00155

[5] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B, Vol. 58, No. 1, 1996, pp. 267-288.

[6] W. Sauerbrei, P. Royston and H. Binder, “Selection of Important Variables and Determination of Functional Form for Continuous Predictors in Multivariable Model Building,” Statistics in Medicine, Vol. 26, No. 30, 2007, pp. 5512-5528. doi:10.1002/sim.3148

[7] N. Mantel, “Why Stepdown Procedures in Variable Se lection?” Technometrics, Vol. 12, No. 3, 1970, pp. 621-625. doi:10.1080/00401706.1970.10488701

[8] P. Royston and W. Sauerbrei, “Multivariable Model-Building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modelling Continuous Variables,” Wiley, Chichester, 2008. doi:10.1002/9780470770771

[9] J. C. van Houwelingen, “Shrinkage and Penalized Likelihood as Methods to Improve Predictive Accuracy,” Statistica Neerlandica, Vol. 55, No. 1, 2001, pp. 17-34. doi:10.1111/1467-9574.00154

[10] W. Sauerbrei, N. Holl?nder and A. Buchholz, “Investigation about a Screening Step in Model Selection,” Statistics and Computing, Vol. 18, No. 2, 2008, pp. 195-208. doi:10.1007/s11222-007-9048-5

[11] J. B. Copas, “Regression, Prediction and Shrinkage (with Discussion),” Journal of the Royal Statistical Society Series B-Methodological, Vol. 45, No. 3, 1983, pp. 311-354.

[12] L. Breiman, “Better Subset Regression Using the Non negative Garrote,” Technometrics, Vol. 37, No. 4, 1995, pp. 373-384. doi:10.1080/00401706.1995.10484371

[13] K. Vach, W. Sauerbrei and M. Schumacher, “Variable Selection and Shrinkage: Comparison of Some Approaches,” Statistica Neerlandica, Vol. 55, No. 1, 2001, pp. 53-75. doi:10.1111/1467-9574.00156

[14] J. C. Wyatt and D. G. Altman, “Prognostic Models: Clinically Useful or Quickly Forgotten?” British Medical Journal, Vol. 311, No. 7019, 1995, pp. 1539-1541. doi:10.1136/bmj.311.7019.1539

[15] S. Varma and R. Simon, “Bias in Error Estimation When Using Cross-Validation for Model Selection,” BMC Bio informatics, Vol. 7, No. 91, 2006. doi:10.1186/1471-2105-7-91

[16] M. Schumacher, N. Holl?nder and W. Sauerbrei, “Re sampling and Cross-Validation Techniques: A Tool to Reduce Bias Caused by Model Building?” Statistics in Medicine, Vol. 16, No. 24, 1997, pp. 2813-2827. doi:10.1002/(SICI)1097-0258(19971230)16:24<2813::AID-SIM701>3.0.CO;2-Z

[17] G. Ihorst, T. Frischer, F. Horak, M. Schumacher, M. Kopp, J. Forster, J. Mattes and J. Kuehr, “Long and Medium-Term Ozone Effects on Lung Growth Including a Broad Spectrum of Exposure,” European Respiratory Journal, Vol. 23, No. 2, 2004, pp. 292-299. doi:10.1183/09031936.04.00021704

[18] A. Buchholz, N. Holl?nder and W. Sauerbrei, “On Properties of Predictors Derived with a Two-Step Bootstrap Model Averaging Approach—A Simulation Study in the Linear Regression Model,” Computational Statistics and Data Analysis, Vol. 52, No. 5, 2008, pp. 2778-2793. doi:10.1016/j.csda.2007.10.007

[19] R. W. Johnson, “Fitting Percentage of Body Fat to Simple Body Measurements,” Journal of Statistics Education, Vol. 4, No. 1, 1996.

[20] F. E. Harrell, “Regression Modeling Strategies, with Applications to Linear Models, Logistic Regression and Survival Analysis,” Springer, New York, 2001.

[21] E. Steyerberg, R. Eijkemans, F. Harrell and J. Habbema, “Prognostic Modelling with Logistic Regression Analysis: A Comparison of Selection and Estimation Methods in Small Data Sets,” Statistics in Medicine, Vol. 19, No. 8, 2000, pp. 1059-1079. doi:10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0

[22] J. Bien, J. Taylor and R. Tibshirani, “A Lasso for Hierarchical Interactions,” Submitted 2012.

[23] F. E. Harrell, K. L. Lee, R. M. Califf, D. B. Pryor and R. A. Rosati, “Regression Modeling Strategies for Improved Prognostic Prediction,” Statistics in Medicine, Vol. 3, No. 2, 1984, pp. 143-152. doi:10.1002/sim.4780030207

[24] J. Q. Fan and R. Z. Li, “Variable Selection via Noncon cave Penalized Likelihood and Its Oracle Properties,” Journal of the American Statistical Association, Vol. 96, No. 456, 2001, pp. 1348-1360. doi:10.1198/016214501753382273

[25] H. Zou and T. Hastie, “Regularization and Variable Se lection via the Elastic Net,” Journal of the Royal Statistical Society Series B, Vol. 67, No. 2, 2005, pp. 301-320. doi:10.1111/j.1467-9868.2005.00503.x

[26] C. Porzelius, M. Schumacher and H. Binder, “Sparse Regression Techniques in Low-Dimensional Survival Data Settings,” Statistics and Computing, Vol. 20, No. 2, 2010, pp. 151-163. doi:10.1007/s11222-009-9155-6

[27] C. L. Leng, Y. Lin and G. Wahba, “A Note on the Lasso and Related Procedures in Model Selection,” Statistica Sinica, Vol. 16, 2006, pp. 1273-1284.

[1] C. Chen and S. L. George, “The Bootstrap and Identification of Prognostic Factors via Cox’s Proportional Hazards Regression Model,” Statistics in Medicine, Vol. 4, No. 1, 1985, pp. 39-46. doi:10.1002/sim.4780040107

[2] J. C. van Houwelingen and S. le Cessie, “Predictive Value of Statistical Models,” Statistics in Medicine, Vol. 9, No. 11, 1990, pp. 1303-1325. doi:10.1002/sim.4780091109

[3] F. E. Harrell, K. L. Lee and D. B. Mark, “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors,” Statistics in Medicine, Vol. 15, No. 4, 1996, pp. 361-387. doi:10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4

[4] W. Sauerbrei, “The Use of Resampling Methods to Simplify Regression Models in Medical Statistics,” Journal of the Royal Statistical Society Series C—Applied Statis tics, Vol. 48, No. 3, 1999, pp. 313-329. doi:10.1111/1467-9876.00155

[5] R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Series B, Vol. 58, No. 1, 1996, pp. 267-288.

[6] W. Sauerbrei, P. Royston and H. Binder, “Selection of Important Variables and Determination of Functional Form for Continuous Predictors in Multivariable Model Building,” Statistics in Medicine, Vol. 26, No. 30, 2007, pp. 5512-5528. doi:10.1002/sim.3148

[7] N. Mantel, “Why Stepdown Procedures in Variable Se lection?” Technometrics, Vol. 12, No. 3, 1970, pp. 621-625. doi:10.1080/00401706.1970.10488701

[8] P. Royston and W. Sauerbrei, “Multivariable Model-Building: A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modelling Continuous Variables,” Wiley, Chichester, 2008. doi:10.1002/9780470770771

[9] J. C. van Houwelingen, “Shrinkage and Penalized Likelihood as Methods to Improve Predictive Accuracy,” Statistica Neerlandica, Vol. 55, No. 1, 2001, pp. 17-34. doi:10.1111/1467-9574.00154

[10] W. Sauerbrei, N. Holl?nder and A. Buchholz, “Investigation about a Screening Step in Model Selection,” Statistics and Computing, Vol. 18, No. 2, 2008, pp. 195-208. doi:10.1007/s11222-007-9048-5

[11] J. B. Copas, “Regression, Prediction and Shrinkage (with Discussion),” Journal of the Royal Statistical Society Series B-Methodological, Vol. 45, No. 3, 1983, pp. 311-354.

[12] L. Breiman, “Better Subset Regression Using the Non negative Garrote,” Technometrics, Vol. 37, No. 4, 1995, pp. 373-384. doi:10.1080/00401706.1995.10484371

[13] K. Vach, W. Sauerbrei and M. Schumacher, “Variable Selection and Shrinkage: Comparison of Some Approaches,” Statistica Neerlandica, Vol. 55, No. 1, 2001, pp. 53-75. doi:10.1111/1467-9574.00156

[14] J. C. Wyatt and D. G. Altman, “Prognostic Models: Clinically Useful or Quickly Forgotten?” British Medical Journal, Vol. 311, No. 7019, 1995, pp. 1539-1541. doi:10.1136/bmj.311.7019.1539

[15] S. Varma and R. Simon, “Bias in Error Estimation When Using Cross-Validation for Model Selection,” BMC Bio informatics, Vol. 7, No. 91, 2006. doi:10.1186/1471-2105-7-91

[16] M. Schumacher, N. Holl?nder and W. Sauerbrei, “Re sampling and Cross-Validation Techniques: A Tool to Reduce Bias Caused by Model Building?” Statistics in Medicine, Vol. 16, No. 24, 1997, pp. 2813-2827. doi:10.1002/(SICI)1097-0258(19971230)16:24<2813::AID-SIM701>3.0.CO;2-Z

[17] G. Ihorst, T. Frischer, F. Horak, M. Schumacher, M. Kopp, J. Forster, J. Mattes and J. Kuehr, “Long and Medium-Term Ozone Effects on Lung Growth Including a Broad Spectrum of Exposure,” European Respiratory Journal, Vol. 23, No. 2, 2004, pp. 292-299. doi:10.1183/09031936.04.00021704

[18] A. Buchholz, N. Holl?nder and W. Sauerbrei, “On Properties of Predictors Derived with a Two-Step Bootstrap Model Averaging Approach—A Simulation Study in the Linear Regression Model,” Computational Statistics and Data Analysis, Vol. 52, No. 5, 2008, pp. 2778-2793. doi:10.1016/j.csda.2007.10.007

[19] R. W. Johnson, “Fitting Percentage of Body Fat to Simple Body Measurements,” Journal of Statistics Education, Vol. 4, No. 1, 1996.

[20] F. E. Harrell, “Regression Modeling Strategies, with Applications to Linear Models, Logistic Regression and Survival Analysis,” Springer, New York, 2001.

[21] E. Steyerberg, R. Eijkemans, F. Harrell and J. Habbema, “Prognostic Modelling with Logistic Regression Analysis: A Comparison of Selection and Estimation Methods in Small Data Sets,” Statistics in Medicine, Vol. 19, No. 8, 2000, pp. 1059-1079. doi:10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0

[22] J. Bien, J. Taylor and R. Tibshirani, “A Lasso for Hierarchical Interactions,” Submitted 2012.

[23] F. E. Harrell, K. L. Lee, R. M. Califf, D. B. Pryor and R. A. Rosati, “Regression Modeling Strategies for Improved Prognostic Prediction,” Statistics in Medicine, Vol. 3, No. 2, 1984, pp. 143-152. doi:10.1002/sim.4780030207

[24] J. Q. Fan and R. Z. Li, “Variable Selection via Noncon cave Penalized Likelihood and Its Oracle Properties,” Journal of the American Statistical Association, Vol. 96, No. 456, 2001, pp. 1348-1360. doi:10.1198/016214501753382273

[25] H. Zou and T. Hastie, “Regularization and Variable Se lection via the Elastic Net,” Journal of the Royal Statistical Society Series B, Vol. 67, No. 2, 2005, pp. 301-320. doi:10.1111/j.1467-9868.2005.00503.x

[26] C. Porzelius, M. Schumacher and H. Binder, “Sparse Regression Techniques in Low-Dimensional Survival Data Settings,” Statistics and Computing, Vol. 20, No. 2, 2010, pp. 151-163. doi:10.1007/s11222-009-9155-6

[27] C. L. Leng, Y. Lin and G. Wahba, “A Note on the Lasso and Related Procedures in Model Selection,” Statistica Sinica, Vol. 16, 2006, pp. 1273-1284.