Statistical Methods of SNP Data Analysis and Applications

Author(s)
Alexander Bulinski,
Oleg Butkovsky,
Victor Sadovnichy,
Alexey Shashkin,
Pavel Yaskov,
Alexander Balatskiy,
Larisa Samokhodskaya,
Vsevolod Tkachuk

ABSTRACT

We develop various statistical methods important for multidimensional genetic data analysis. Theorems justifying application of these methods are established. We concentrate on the multifactor dimensionality reduction, logic regression, random forests, stochastic gradient boosting along with their new modifications. We use complementary approaches to study the risk of complex diseases such as cardiovascular ones. The roles of certain combinations of single nucleotide polymorphisms and non-genetic risk factors are examined. To perform the data analysis concerning the coronary heart disease and myocardial infarction the Lomonosov Moscow State University supercomputer “Chebyshev” was employed.

We develop various statistical methods important for multidimensional genetic data analysis. Theorems justifying application of these methods are established. We concentrate on the multifactor dimensionality reduction, logic regression, random forests, stochastic gradient boosting along with their new modifications. We use complementary approaches to study the risk of complex diseases such as cardiovascular ones. The roles of certain combinations of single nucleotide polymorphisms and non-genetic risk factors are examined. To perform the data analysis concerning the coronary heart disease and myocardial infarction the Lomonosov Moscow State University supercomputer “Chebyshev” was employed.

Cite this paper

A. Bulinski, O. Butkovsky, V. Sadovnichy, A. Shashkin, P. Yaskov, A. Balatskiy, L. Samokhodskaya and V. Tkachuk, "Statistical Methods of SNP Data Analysis and Applications,"*Open Journal of Statistics*, Vol. 2 No. 1, 2012, pp. 73-87. doi: 10.4236/ojs.2012.21008.

A. Bulinski, O. Butkovsky, V. Sadovnichy, A. Shashkin, P. Yaskov, A. Balatskiy, L. Samokhodskaya and V. Tkachuk, "Statistical Methods of SNP Data Analysis and Applications,"

References

[1] Y. Fujikoshi, R. Shimizu and V. V. Ulyanov, “Multivariate Statistics: High-Dimensional and Large-Sample Approximations,” Wiley, Hoboken, 2010.

[2] S. Szymczak, J. Biernacka, H. Cordell, O. González- Recio, I. K?nig, H. Zhang and Y. Sun, “Machine Learning in Genome-Wide Association Studies,” Genetic Epidemiology, Vol. 33, No. S1, 2009, pp. 51-57. doi:10.1002/gepi.20473

[3] D. Brinza, M. Schultz, G. Tesler and V. Bafna, “RAPID Detection of Gene-Gene Interaction in Genome-Wide Association Studies”, Bioinformatics, Vol. 26, No. 22, 2010, pp. 2856-2862. doi:10.1093/bioinformatics/btq529

[4] K. Wang, S. P. Dickson, C. A. Stolle, I. D. Krantz, D. B. Goldstein and H. Hakonarson, “Interpretation of Association Signals and Identification of Causal Variants from Genome-Wide Association Studies,” The American Journal of Human Genetics, Vol. 86, No. 5, 2010, pp. 730-742. doi:10.1016/j.ajhg.2010.04.003

[5] Y. Liang and A. Kelemen. “Statistical Advances and Challenges for Analyzing Correlated High Dimensional SNP Data in Genomic Study for Complex Diseases,” Statistics Surveys, Vol. 2, No. 1, 2008, pp. 43-60. doi: 10.1214/07-SS026

[6] H. Schwender and I. Ruczinski, “Testing SNPs and Sets of SNPs for Importance in Association Studies,” Bio- statistics, Vol. 12, No. 1, 2011, pp. 18-32. doi: 10.1093/biostatistics/kxq042

[7] M. Ritchie, L. Hahn, N. Roodi, R. Bailey, W. Dupont, F. Parl and J. Moore, “Multifactor-Dimensionality Red- uction Reveals High-Order Interactions Among Estrogen- Metabolism Genes in Sporadic Breast Cancer,” The American Journal of Human Genetics, Vol. 69, No. 1, 2001, pp. 138-147. doi:10.1086/321276

[8] D. Velez, B. White, A. Motsinger, W. Bush, M. Ritchie, S. Williams and J. Moore, “A Balanced Accuracy Fun- ction for Epistasis Modeling in Imbalanced Datasets Using Multifactor Dimensionality Reduction,” Genetic Epidemiology, Vol. 31, No. 4, 2007, pp. 306-315. doi: 10.1002/gepi.20211

[9] I. Ruczinski, C. Kooperberg and M. LeBlanc, “Logic Regression,” Journal of Computational and Graphical Statiststics, Vol. 12, No. 3, 2003, pp. 475-511. doi: 10.1198/1061860032238

[10] H. Schwender and K. Ickstadt, “Identification of SNP Interactions Using Logic Regression,” Biostatistics, Vol. 9, No. 1, 2008, pp. 187-198. doi: 10.1093/biostatistics/kxm024

[11] L. Breiman, “Random Forests,” Machine Learning, Vol. 45, No. 1, 2001, pp. 5-32. doi:10.1023/A:1010933404324

[12] J. Friedman, “Stochastic Gradient Boosting,” Computational Statistics & Data analysis, Vol. 38, No. 4, 2002, pp. 367-378.

[13] X. Wan, C. Yang, Q. Yang, H. Xue, N. Tang and W. Yu, “Mega SNP Hunter: A Learning Approach to Detect Disease Predisposition SNPs and High Level Interactions in Genome Wide Association Study,” BMC Bioinformatics, Vol. 10, 2009, p. 13. doi:10.1186/1471-2105-10-13

[14] A. Bulinski, O. Butkovsky, A. Shashkin, P. Yaskov, M. Atroshchenko and A. Khaplanov, “Statistical Methods of SNPs Analysis,” Technical Report, 2010, pp. 1-159 (in Russian).

[15] G. Bradley-Smith, S. Hope, H. V. Firth and J. A. Hurst, “Oxford Handbook of Genetics,” Oxford University Press, New York, 2010.

[16] S. Winham, A. Slater and A. Motsinger-Reif, “A Comparison of Internal Validation Techniques for Multi- factor Dimensionality Reduction,” BMC Bioinformatics, Vol. 11, 2010, p. 394. doi:10.1186/1471-2105-11-394

[17] A. Arlot and A. Celisse, “A Survey of Cross-validation Procedures for Model Selection,” Statistics Surveys, Vol. 4, No. 1, 2010, pp. 40-79. doi:10.1214/09-SS054

[18] T. Hastie, R. Tibshirani and J. Friedman, “The Elements of Statistical Learning: Data Mining, Inference, and Prediction,” 2nd Edition, Springer, New York, 2009.

[19] R. L. Taylor and T.-C. Hu, “Strong Laws of Large Numbers for Arrays of Rowwise Independent Random Elements,” International Journal of Mathematics and Mathematical Sciences, Vol. 10, No. 4, 1987, pp. 805-814.

[20] E. Lehmann and J. Romano, “Testing Statistical Hypo- theses,” Springer, New York, 2005.

[21] P. Golland, F. Liang, S. Mukherjee and D. Panchenko, “Permutation Tests for Classification,” Lecture Notes in Computer Science, Vol. 3559, 2005, pp. 501-515.

[22] J. Park, “Independent Rule in Classification of Multi- variate Binary Data,” Journal of Multivariate Analysis, Vol. 100, No. 10, 2009, pp. 2270-2286. doi:10.1016/j.jmva.2009.05.004

[23] S. Lee, Y. Chung, R. Elston, Y. Kim, and T. Park, “Log- Linear Model-Based Multifactor Dimensionality Reduction Method to Detect Gene-gene Interactions,” Bioinformatics, Vol. 23, No. 19, 2007, pp. 2589-2595. doi:10.1093/bioinformatics/btm396

[24] A. Nikolaev and S. Jacobson, “Simulated Annealing,” In: M. Gendreau and J.-Y. Potvin, Eds., Handbook of Metaheuristics, Springer, New York, 2010, pp. 1-39.

[25] G. Biau, “Analysis of a Random Forests Model,” LSTA, LPMA, Paris, 2010.

[26] N. Chawla, “Data Mining for Imbalanced Datasets: An Overview,” In: O. Maimon and L. Rokach, Eds., Data Mining and Knowledge Discovery Handbook, Springer, New York, 2010, pp. 875-886.

[27] C. Strobl, A. Boulesteix, T. Kneib, T. Augustin and A. Zeileis, “Conditional Variable Importance for Random Forests,” BMC Bioinformatics, Vol. 9, 2008, p. 307. doi:10.1186/1471-2105-8-25

[28] A. Hirashiki, Y. Yamada, Y. Murase, Y. Suzuki, H. Kataoka, Y. Morimoto, T. Tajika, T. Murohara and M. Yokota, “Association of Gene Polymorphisms with Coronary Artery Disease in Low- or High-Risk Subjects Defined by Conventional Risk Factors,” Journal of the American College of Cardiology, Vol. 42, No. 8, 2003, pp. 1429-1437. doi:10.1016/S0735-1097(03)01062-3

[29] A. Balatskiy, E. Andreenko and L. Samokhodskaya, “The Connexin37 Polymorphism as a New Risk Factor of MI Development,” Siberian Medical Journal, Vol. 25, No. 2, 2010, pp. 64-65 (in Russian).

[30] C. Coffey, P. Hebert, M. Ritchie, H. Krumholz, J. Gaziano, P. Ridker, N. Brown, D. Vaughan and J. Moore, “An Application of Conditional Logistic Regression and Multifactor Dimensionality Reduction for Detecting Gene- Gene Interactions on Risk of Myocardial Infarction: The Importance of Model Validation,” BMC Bioinformatics, Vol. 5, 2004, p. 49. doi:10.1186/1471-2105-5-49

[1] Y. Fujikoshi, R. Shimizu and V. V. Ulyanov, “Multivariate Statistics: High-Dimensional and Large-Sample Approximations,” Wiley, Hoboken, 2010.

[2] S. Szymczak, J. Biernacka, H. Cordell, O. González- Recio, I. K?nig, H. Zhang and Y. Sun, “Machine Learning in Genome-Wide Association Studies,” Genetic Epidemiology, Vol. 33, No. S1, 2009, pp. 51-57. doi:10.1002/gepi.20473

[3] D. Brinza, M. Schultz, G. Tesler and V. Bafna, “RAPID Detection of Gene-Gene Interaction in Genome-Wide Association Studies”, Bioinformatics, Vol. 26, No. 22, 2010, pp. 2856-2862. doi:10.1093/bioinformatics/btq529

[4] K. Wang, S. P. Dickson, C. A. Stolle, I. D. Krantz, D. B. Goldstein and H. Hakonarson, “Interpretation of Association Signals and Identification of Causal Variants from Genome-Wide Association Studies,” The American Journal of Human Genetics, Vol. 86, No. 5, 2010, pp. 730-742. doi:10.1016/j.ajhg.2010.04.003

[5] Y. Liang and A. Kelemen. “Statistical Advances and Challenges for Analyzing Correlated High Dimensional SNP Data in Genomic Study for Complex Diseases,” Statistics Surveys, Vol. 2, No. 1, 2008, pp. 43-60. doi: 10.1214/07-SS026

[6] H. Schwender and I. Ruczinski, “Testing SNPs and Sets of SNPs for Importance in Association Studies,” Bio- statistics, Vol. 12, No. 1, 2011, pp. 18-32. doi: 10.1093/biostatistics/kxq042

[7] M. Ritchie, L. Hahn, N. Roodi, R. Bailey, W. Dupont, F. Parl and J. Moore, “Multifactor-Dimensionality Red- uction Reveals High-Order Interactions Among Estrogen- Metabolism Genes in Sporadic Breast Cancer,” The American Journal of Human Genetics, Vol. 69, No. 1, 2001, pp. 138-147. doi:10.1086/321276

[8] D. Velez, B. White, A. Motsinger, W. Bush, M. Ritchie, S. Williams and J. Moore, “A Balanced Accuracy Fun- ction for Epistasis Modeling in Imbalanced Datasets Using Multifactor Dimensionality Reduction,” Genetic Epidemiology, Vol. 31, No. 4, 2007, pp. 306-315. doi: 10.1002/gepi.20211

[9] I. Ruczinski, C. Kooperberg and M. LeBlanc, “Logic Regression,” Journal of Computational and Graphical Statiststics, Vol. 12, No. 3, 2003, pp. 475-511. doi: 10.1198/1061860032238

[10] H. Schwender and K. Ickstadt, “Identification of SNP Interactions Using Logic Regression,” Biostatistics, Vol. 9, No. 1, 2008, pp. 187-198. doi: 10.1093/biostatistics/kxm024

[11] L. Breiman, “Random Forests,” Machine Learning, Vol. 45, No. 1, 2001, pp. 5-32. doi:10.1023/A:1010933404324

[12] J. Friedman, “Stochastic Gradient Boosting,” Computational Statistics & Data analysis, Vol. 38, No. 4, 2002, pp. 367-378.

[13] X. Wan, C. Yang, Q. Yang, H. Xue, N. Tang and W. Yu, “Mega SNP Hunter: A Learning Approach to Detect Disease Predisposition SNPs and High Level Interactions in Genome Wide Association Study,” BMC Bioinformatics, Vol. 10, 2009, p. 13. doi:10.1186/1471-2105-10-13

[14] A. Bulinski, O. Butkovsky, A. Shashkin, P. Yaskov, M. Atroshchenko and A. Khaplanov, “Statistical Methods of SNPs Analysis,” Technical Report, 2010, pp. 1-159 (in Russian).

[15] G. Bradley-Smith, S. Hope, H. V. Firth and J. A. Hurst, “Oxford Handbook of Genetics,” Oxford University Press, New York, 2010.

[16] S. Winham, A. Slater and A. Motsinger-Reif, “A Comparison of Internal Validation Techniques for Multi- factor Dimensionality Reduction,” BMC Bioinformatics, Vol. 11, 2010, p. 394. doi:10.1186/1471-2105-11-394

[17] A. Arlot and A. Celisse, “A Survey of Cross-validation Procedures for Model Selection,” Statistics Surveys, Vol. 4, No. 1, 2010, pp. 40-79. doi:10.1214/09-SS054

[18] T. Hastie, R. Tibshirani and J. Friedman, “The Elements of Statistical Learning: Data Mining, Inference, and Prediction,” 2nd Edition, Springer, New York, 2009.

[19] R. L. Taylor and T.-C. Hu, “Strong Laws of Large Numbers for Arrays of Rowwise Independent Random Elements,” International Journal of Mathematics and Mathematical Sciences, Vol. 10, No. 4, 1987, pp. 805-814.

[20] E. Lehmann and J. Romano, “Testing Statistical Hypo- theses,” Springer, New York, 2005.

[21] P. Golland, F. Liang, S. Mukherjee and D. Panchenko, “Permutation Tests for Classification,” Lecture Notes in Computer Science, Vol. 3559, 2005, pp. 501-515.

[22] J. Park, “Independent Rule in Classification of Multi- variate Binary Data,” Journal of Multivariate Analysis, Vol. 100, No. 10, 2009, pp. 2270-2286. doi:10.1016/j.jmva.2009.05.004

[23] S. Lee, Y. Chung, R. Elston, Y. Kim, and T. Park, “Log- Linear Model-Based Multifactor Dimensionality Reduction Method to Detect Gene-gene Interactions,” Bioinformatics, Vol. 23, No. 19, 2007, pp. 2589-2595. doi:10.1093/bioinformatics/btm396

[24] A. Nikolaev and S. Jacobson, “Simulated Annealing,” In: M. Gendreau and J.-Y. Potvin, Eds., Handbook of Metaheuristics, Springer, New York, 2010, pp. 1-39.

[25] G. Biau, “Analysis of a Random Forests Model,” LSTA, LPMA, Paris, 2010.

[26] N. Chawla, “Data Mining for Imbalanced Datasets: An Overview,” In: O. Maimon and L. Rokach, Eds., Data Mining and Knowledge Discovery Handbook, Springer, New York, 2010, pp. 875-886.

[27] C. Strobl, A. Boulesteix, T. Kneib, T. Augustin and A. Zeileis, “Conditional Variable Importance for Random Forests,” BMC Bioinformatics, Vol. 9, 2008, p. 307. doi:10.1186/1471-2105-8-25

[28] A. Hirashiki, Y. Yamada, Y. Murase, Y. Suzuki, H. Kataoka, Y. Morimoto, T. Tajika, T. Murohara and M. Yokota, “Association of Gene Polymorphisms with Coronary Artery Disease in Low- or High-Risk Subjects Defined by Conventional Risk Factors,” Journal of the American College of Cardiology, Vol. 42, No. 8, 2003, pp. 1429-1437. doi:10.1016/S0735-1097(03)01062-3

[29] A. Balatskiy, E. Andreenko and L. Samokhodskaya, “The Connexin37 Polymorphism as a New Risk Factor of MI Development,” Siberian Medical Journal, Vol. 25, No. 2, 2010, pp. 64-65 (in Russian).

[30] C. Coffey, P. Hebert, M. Ritchie, H. Krumholz, J. Gaziano, P. Ridker, N. Brown, D. Vaughan and J. Moore, “An Application of Conditional Logistic Regression and Multifactor Dimensionality Reduction for Detecting Gene- Gene Interactions on Risk of Myocardial Infarction: The Importance of Model Validation,” BMC Bioinformatics, Vol. 5, 2004, p. 49. doi:10.1186/1471-2105-5-49