A Preliminary Outline of the Statistical Inference Process in Genetic Association Studies ()
1. Introduction
A genome-wide association study (GWAS) is an inclusive genetic analysis to identify associations between specific genetic variations in the form of single-nucleotide polymorphism (SNPs) and phenotypic traits. These studies are very effective in genetic epidemiology as they provide a relatively superficial approach for detecting potential genetic contributors to common and complex diseases using a simple case-control study model [1] - [6].
Correct performance of such genetic association studies requires interdisciplinary knowledge. Specifically, knowledge of genetics, statistics, and bioinformatics are the primary key [6]. In this pathway, in-depth knowledge of the genetic architecture of the human genome was provided by two important research initiatives, the International HapMapProject and the 1000 Genomes project. The International HapMapProject [7] described the patterns of common SNPs within the human DNA (deoxyribonucleic acid) sequence whereas the 1000 Genomes (1KG) project [8] provided a map of both common and rare SNPs [6].
In such studies, very large sample sizes are required to identify and validate findings. The careful attention to data quality has been appreciated as even small sources of systematic or random error can cause spurious results. Hence, a number of strategies for quality control have been developed [6] [9] [10].
Along with these quality control and quality assurance of genotypic data, appropriate statistical association testing will need to be carefully conducted using sophisticated and dedicated genetics software [6].
The method of presenting the GWAS finding regarding the reporting of disease-associated or risk markers are quite different from most clinical or epidemiological studies. Particularly, p-values of a single SNP test along with its associated odds ratio are emphasized in case of results presentation [5] [11].
Here, the likelihood of the odd-ratios between two different alleles being statistically different than one is reported by the p-values in a GWAS, where the typical threshold of significance level for the most published GWAS is p = 5 × 10−08 [5].
This study focuses on the statistical analysis after genotyping calls are made and quality control and assurance measures are taken [11]. The main objective of this paper is to provide an overview of some introductory statistical analysis of GWAS using the SNP genotype DNA sequencing data.
2. GWAS Data Preparation
For the independence tests, the SNP genotype data for each gene were generated for 3000 individuals using computer simulation in R-programming language by assigning the equal probability for both the cases (0.5) and the controls (0.5) (Data 1). Another data containing the GWAS results was also generated via computer simulation in R, which is a replica of “PLINK assoc” output (https://zzz.bwh.harvard.edu/plink/anal.shtml) containing the following information (Data 2).
3. Demonstration of the Statistical Analysis
3.1. Testing Association
GWAS generally tests association for a single SNP which is a contingency table test of genotype counts and disease phenotype. For example, for a SNP with major allele “A” and minor allele “a”, the genotype counts (A/A, A/a and a/a) can be presented in a 2 × 3 contingency table along with a binary disease status or phenotype (case-control) (Table 1) [12].
It is expected that the relative allele or genotype frequencies to be the same in case and control groups under the null hypothesis of no association. Usually, the association for a given contingency table is tested by the simple chi-squared (χ2) [13] test with two degrees of freedom (d.f.), and the p-value is recorded. Each of these p-values is compared with the GWAS typical threshold of significance level, which is p = 5 × 10−08 for the most published GWAS [5].
For the practical application, this contingency table test (χ2-test) was firstly performed for evaluating each of the three randomly selected genes from the genotype data as described in Section 2 (Data 1). The conventional χ2-test was applied for each of SNPs contained in a gene, where the GENE1, GENE2 and GENE3 have 3, 5 and 10 SNPs, respectively. For example, the GENE1 contains 3 SNPs, the three individual χ2-tests were performed for each of the 3 SNPs, and the p-values were recorded. The tests for the other two genes were also performed in the same manner. The p-values for these three genes are presented in Table 2.
Comparing the p-values to the GWAS threshold p = 5 × 10−08, there is one SNP from the GENE1 is associated with the disease phenotype having p = 5.243314 × 10−10 (Table 2). Hence, GENE1 is associated with the disease phenotype.
Table 1. A 2 × 3 table of genotype counts for a single SNP with disease status.
Table 2. The p-values from the simple χ2-test for the 3 genes.
3.2. Models for the Association Tests
The conventional χ2-test does not include the sense of the genotype ordering (trend). Here, each of the genotypes is assumed to have an independent association with disease phenotype. But, these ordering could be included in the association tests of contingency tables by considering the disease penetrance. The penetrance function is an approach for modeling the relation between SNPs and risk of a given disease with the consideration of genotype ordering [14] [15] [16].
For a single diallelic SNP with alleles “A” (major, disease responsible) and “a” (minor, normal), the unordered genotype counts are presented in Table 1. For a disease status the risk factor is defined by the genotype or allele at a specific marker. Thus, the disease penetrance associated with a given genotype is the risk of disease of the individuals carrying that genotype or allele. This risk of carrying disease responsible genotype could be measured by defining the probabilistic functions, which intern define the conditional probabilities of being affected with a given disease conditional on carrying a specific genotype [12] [17] [18].
For the genotype counts as shown in Table 1, the three models can be defined in terms of a genetic penetrance parameter γ (γ > 1). An additive model implies that risk of developing disease is increased γ-fold for the genotype A/a and by 2γ-fold for the genotype A/A. A recessive model indicates that two copies of allele “A” are required for an γ-fold increase in disease risk, and a dominant model specifies that either one or two copies of allele “A” are required for an γ-fold increase in disease risk. An intuitive measure of the strength of an association is the relative risk (RR). In this genetic association analysis, each genetic model can be represented with the relation to this genotypic relative risks (GRR) under the assumption of phenocopies, where the GRR presents the increased risk of an individual having a disease responsible genotype over a person without it [9] [19].
Generally, the models should be chosen based on the mode of inheritance (dominant, additive and recessive). But, a common problem is the lack of knowledge concerning the mode of inheritance. Assumption of the incorrect mode of inheritance may lead to significant loss of power. Also, testing for all the possible models may increase type I error rate. Some studies have proposed ways to determine the robust procedures which will correctly specify the underlying model of inheritance. Also, these methods perform the analysis by maximizing the power and preserving the nominal type I error rate. The proposed methods are based on the theory of efficiency robust procedures, the deviations from Hardy-Weinberg equilibrium (HWE) and a combination of test statistics for the selection of the underlying genetic model [20] [21].
Now, the practical application of these concepts of penetrance in an association study using the contingency table can be demonstrated in one of the two ways. In one approach, these models of penetrance could be included in the contingency table analysis by rearranging the genotype counts according to the mode of inheritance (additive, dominant and recessive) [9] [19] [22]. The other way is to define the penetrance models that will specify the trends of risk with increasing numbers of disease responsible allele. Hence, the association could be tested by the Cochran-Armitage trend test for the additive, dominant and recessive models [9] [23] [24]. Tests with an additive model are common in GWAS when the underlying genetic model is unknown. This is because this model has reasonable power to detect both additive and dominant effects [9] [19].
For the genotype data (Data 1), the association test was further performed by the Cochran-Armitage trend test for additive model. Single SNP test was applied for each of the three genes (GENE1, GENE2 and GENE3), and the p-values were recorded (Table 3).
From the results of the Cochran-Armitage trend test, it was observed that the two SNPs from the GENE1 having the p = 6.284831 × 10−11 and 8.586496 × 10−08, respectively, are significant according to the GWAS threshold p = 5 × 10−08, and hence associated with the disease phenotype (Table 3). So, the GENE1 is to be reported as positive among the three.
On the other hand, the p-values from the two tests (the conventional χ2-test and the Cochran-Armitage trend test) are completely different from each other (Table 2 and Table 3). That is, considering the order of the genotypes is producing the different outputs as compared to the unordered case. More specifically, overall the trend test is producing smaller p-values except for some cases. Though, the GENE1 is resulting in significant association in both of the two tests. But, the conventional χ2-test is showing only one significant SNP whereas the Cochran-Armitage trend test is resulting in two significant SNPs. The SNP2 is significant for both the tests but the Cochran-Armitage trend test is producing smaller p-value (Table 2 and Table 3).
3.3. Multiple Testing Corrections
The GWAS evaluates several thousand of genes simultaneously under different conditions over the genome, where each gene consists of a different number of SNPs. Hence, such association studies consider huge number of simultaneous testing of the null hypothesis, which constitutes multiple testing.
In order to control the type I error rate accurately, an adjustment is required for the p-values obtained from such simultaneous testing process, because detection of false positives may occur in such microarray data analysis. Here, the false positives are genes that are found to be statistically different between conditions, but are not in reality.
Different types of multiple testing corrections include Bonferroni, Holm, Benjamini and Hochberg False Discovery Rate, etc. [9] [25] [26]. All of these methods have some underlying principles to be applied in practice.
As a practical application, the Bonferroni correction is applied for the marginal p-values obtained from both the tests (the conventional χ2-test and the Cochran-Armitage trend test) for the significant gene (GENE1). In this approach, the p-values are multiplied by the number of comparisons. The corrected p-values are presented in Table 4.
Table 3. The p-values from the Cochran-Armitage trend test for the 3 genes.
Table 4. The Bonferroni corrected p-values for both the conventional χ2-test and the Cochran-Armitage trend test for the significant gene, GENE1.
Overall, the p-values are changed after the correction. Specifically, this correction method provides quite larger p-values as compared to the marginal values. For the same gene (GENE1), the number of the significant SNPs remains the same for the conventional χ2-test. But, the scenario is changed for the Cochran-Armitage trend test. For this case, there were two significant SNPs before the correction whereas only one SNP is showing association after the correction is performed (Table 4).
4. Manhattan Plot
A commonly used plot in most GWAS to display the significant SNPs in terms of p-values summarizes the results of the millions of tests, which have been performed. It is also a presentation of the p-values of the entire GWAS on a genomic scale. The horizontal x-axis is a map of the genome (genomic coordinates) organized from left to right by chromosome, and within chromosome by location. The different colors of each block usually show the extent of each chromosome. Each dot on the Manhattan plot signifies a SNP. The vertical location of the SNP along the y-axis is its p-value for the correlation of itself with the phenotype. The p-values are negative-log-transformed (that is the −log10(p) label on the y-axis) so that the smaller values are higher on the plot [27].
Figure 1 is presenting the Manhattan plot of all the p-values from the PLINK output (Data 2). The red line indicates the threshold for genome-wide significance (p = 5 × 10−08), and the blue line for suggestive associations (p = 1 × 10−05).
Figure 1. Manhattan plot of all the p-values for the GWAS results (Data 2).
There is one SNP that crossed the red line (the solid grey dot circled by a red border, Figure 1). That is, this SNP is significant according to the GWAS threshold. Based on the information of GWAS results (Data 2), this significant SNP is on chromosome 6 with the p-value 4 × 10−09. As the y-axis presents the −log10(p), which is equivalent to the number of zeros after the decimal point plus one. For example, see the presentation of the p-value = 4 × 10−09 for the significant SNP indicated in a red circle in Figure 1. On the other hand, two SNPs on chromosomes 3 and 4 (the black and the grey dots between the blue and the red lines) have crossed the suggestive significance level (the blue line) having the p-values 1 × 10−07 and 2 × 10−06, respectively.
5. Dedicated Software for GWAS
The open-source statistical software like R [28] can be used for performing and visualizing all of these analyses as mentioned in this paper. But, there are some customized and dedicated GWAS software. For example, PLINK [29] is the most popular and computationally efficient software program that offers an inclusive and properly documented set of an automated GWAS analysis including the quality control, association testing, etc. The open source software PLINK is written in C++ and can be installed on Windows, Mac and UNIX machines.
6. Conclusion
A practical guideline for the GWAS association testing is provided in this paper. Although this application considers simulated data of SNP genotype and the GWAS results, the real data sets can also be handled in similar ways as outlined here. All of these theoretical contexts of statistical association testing along with the practical application would be made GWAS more accessible to statistical researchers without having any formal training in this field.