Genomic selection is a type of marker assisted selection which involves the estimation of genomic breeding values (GEBV) based on a large number of markers across the genome  . Genomic selection relies on the assumption that all relevant quantitative loci (QTL) are in linkage disequilibrium (LD) with genotyped SNP markers. Thus, linkage disequilibrium or the non-random association of alleles at different loci  across genotyped markers and between the later and QTLs will fundamentally condition the efficiency of the association analysis and it is of great importance in QTL mapping, genomic selection and genome wide association studies. Although the strength of LD between genotyped SNP markers is easy to calculate, inferring the level of LD between SNP markers and QTLs is a complex problem due to the unavailability of QTL genotypes in the majority of genomic association studies. Although the knowledge of the QTL(s) genotypes or their LD with SNP markers in the panel is not needed in association studies, such information could be of great interest in some applications such as multi-breed and crossbred genomic selection.
Genomic selection has been successful in prediction of genomic breeding values. However this success did not extend to admixed breeds or crossbreds. Several studies showed that the structure of the reference population strongly impacts the accuracy of genomic predictions     . Moreover, SNP marker estimates derived from one breed have little to no predictive power of GEBVs of animals in a different breed   . A potential solution would be to use a pooled multi-breed reference population to predict GEBV of animals in other breeds or crossbred animals      . This method showed promising results in improving prediction accuracy in the case when a breed has a limited number of records. However, the performance of this approach, as expected, depends largely on the genetic similarity between components of the admixed population.
Although simple in its concept, the multi-breed reference population approach makes strong genetic and population structure assumptions. In its most basic formulation, it assumes a genetically homogenous population where SNP marker effects are constant across sub-populations or breeds. Further, it assumes that linkage disequilibrium (LD) between SNPs and QTLs is the same across the reference and validation populations. Although that is the case for within breed genomic selection, such assumption is often violated when breeds with different genetic structure and background are being considered. This genetic difference between breeds is manifested by varying allele frequencies for markers and QTLs, change in LD strength and structure, and linkage phase     . Furthermore, several studies have evaluated LD blocks in various population structures and reported differences in the extent of LD. For example, Shifman et al. (2003) showed that LD was several folds higher in isolated population than out bred populations very likely due to higher inbreeding  . Similarly, Lindbladtoh et al. (2005) reported, as expected, larger LD blocks within breeds than across breeds  . Hay and Rekaya (2015) showed that accommodating the potential change in SNP effects between the different components of an admixed population, increased accuracies of genomic prediction   . When change in SNP effects was directly modeled, substantial increase in accuracies was observed compared to the classical pooled data approach. Unfortunately, such model suffers from high dimensionality and numeral instability especially in presence for large number of SNPs. Their indirect approach to account for change in SNP effects was based on heuristically developed structural model using available information on marker genotypes. Although it remedies the problems associated with the direct approach and yields better results than the classical pooled data model, its performance are significantly lower than the direct approach. These results indicate that change in the distribution of SNP marker genotypes between sub-populations is likely to carry relevant information about change of LD structure and strength between markers and QTLs across components of the admixed population that could be garnished to account for change in SNP effects. Since genomic selection largely depends on LD structure, it is of great importance to be able to evaluate and infer the magnitude of change in LD between SNP markers and QTLs in different populations. This information might shed some light on the change of SNP effects across different breeds or lines and how to adjust for this change. The objective of this study is to evaluate and infer the change of LD between markers and QTLs across two breeds using simulated data sets.
2. Materials and Methods
As indicated in the introduction section, genetic heterogeneity between sub-populations leads to change in estimates of SNP effects due to change in LD between observed markers and putative QTLs. The foundation of genome wide associations is that QTL effects can be inferred indirectly through their correlation with genotyped markers. Across sub-population, LD structure between markers ( ) as well between markers and QTLs ( ). changes. Consequently, it is reasonable to postulate that change in LD between SNP markers across two sub-populations ( ) could explain, at least partially, the change in LD between markers and QTLs ( ).
In order to evaluate this hypothesis, several small-scale simulations were carried out. In these simulations, the genotypes of the QTL(s) and associated SNPs markers were all assumed known. Thus, LD between SNP markers and QTL(s) was available. In all cases our goal was to test the ability of to predict .
Simulation scenarios: Three simulation scenarios with varying number of SNP markers and QTLs were carried out to test the postulated hypothesis. In all cases, two divergent sub-populations for a trait with heritability equal to 0.5 were generated. A full description of the simulation parameters are presented in the next section. Two models (M1, M2) were evaluated in their ability to predict the change in :
where is the difference of LD between marker k and the QTL across the two sub-populations, and are the mean and standard deviation of the difference of LD between marker k and all the remaining SNPs or a 100 adjacent SNP markers, respectively. and are the same as and , except they represent the relative mean and standard deviation of the difference in LD, and are unknown regression coefficients. To evaluate the fit of the model, the coefficient of determination R2 was calculated.
Linkage disequilibrium across SNP markers and between SNP markers and QTLs in both lines was calculated using the coefficient as proposed by  using the following general equation.
where D is calculated as and , and are observed frequencies of haplotype AB and of alleles A, a, B, and b, respectively. The higher r2, the stronger the linkage disequilibrium.
For all cases and for both models, unknown coefficients were estimated using the proc glm of SAS software  .
Data simulation: QMSim software  was used for data simulation. A historical population of unrelated individuals was simulated and used as a base population for two pure breeds (A and B). Breeds A and B consisted of 1677 and 1668 individuals respectively. The simulated genome consisted of 1 chromosome, with varying number of QTLs and varying number of SNP markers with equal spacing of an average 50 Kb. Minor allele frequency was set to 0.05. QTL additive effects were sampled from a gamma distribution with shape and scale parameter equal to 0.4. Phenotypes were simulated based on a heritability of 0.5. Three simulation scenarios were carried out. In the first scenario, 10 SNP markers and 1 QTL were considered. The QTL was positioned in close proximity to SNP marker 5. In the second scenario the number of markers was increased to 300 SNP markers and also increased the number of QTLs to 3. Finally, in the last simulation scenario, the number of SNP markers was increased to 3000 SNPs and the number of QTLs increased to 30. These QTLs were randomly positioned across the genome. All SNP markers were used in the inference of . In both statistical models (M1, M2), LD between marker k and all the remaining SNPs or 100 adjacent SNP marker windows were implemented.
3. Results and Discussion
Linkage disequilibrium between the SNP markers and the QTL for lines A and B as well as for the first simulation scenario are presented in Table 1. Since the QTL was placed in the center of the simulated segment, the was, as expected, higher for markers 4, 5 and 6. Figure 1 shows the trend of LD between the SNP markers and QTL for the two lines. Similarly, the LD between
Figure 1. Linkage disequilibrium between markers and QTL for breeds A and B.
Table 1. Linkage disequilibrium between markers and QTL for breed A and B in the first simulation scenario.
1LD between marker and QTL for breed A; 2LD between marker and QTL for breed B, 3Difference in marker and QTL LD between breed A and B.
markers ( ) for the two lines as well as the difference in LD were calculated. To infer between the two breeds, the mean and standard deviation of were calculated and later used as explanatory variables in the regression model (Table 2). Fitting model M1 resulted in an R2 of 0.65; indicated that the mean and standard deviation of explained around two thirds of the variation in between breeds A and B. On the other hand, fitting model M2 resulted in 25% decrease in R2 (0.52). Although M2 resulted in a decrease in R2, the model still was able to explain a significant portion of the variation in across the two breeds. When the number of SNP markers and QTLs were increased to 30 and 3, respectively (second simulation scenario), the coefficients of determination tended to decrease using either all the SNP markers (300) or fixed size widows of 100 SNPs to calculate the parameters of the regression model. Table 3 shows the resulting coefficients of
Table 2. Mean and standard deviation of change of LD between markers in the first simulation scenario1.
1Difference in LD of marker and marker between breeds A and B.
Table 3. Coefficient of determination for models M1 and M2 in the second simulation scenario.
determination (R2) for models M1 and M2 using all markers and using fixed windows of 100 SNPs. Using M1 resulted in R2 equal to 0.14, 0.12 and 0.12 for QTLs 1, 2 and 3 respectively using all 300 markers. In the case of using 100 marker windows, R2 increased to 0.26 for QTL 1, 0.24 for QTL 2, and 0.27 for QTL 3. This increase in R2 is due for at least two reasons: 1) a QTL was positioned in each 100 SNP marker window, and 2) including all 300 SNP markers where a large portion of them has no LD with the QTL, resulted in a less informative mean and standard deviation of to explain variation in . The highest increase in R2 was for QTL 3, from 0.12 to 0.27. Using M2, a substantial decrease in R2 was observed across all QTLs using either 100 marker windows or all markers. Table 4 shows the average R2 across all 3 markers, it is clear that M1 performed better than M2 in this simulation scenario.
In the third simulation scenario, a larger SNP panel (3000 SNPs), and a higher number of QTLs (30) were simulated. Table 4 shows the average R2 obtained using M1and M2. Clearly, M1 performed notably better than M2 using either all markers or 100 marker windows. For example, fitting M1 using all markers resulted in an average R2 of 0.27 compared to 0.01 for M2. It should be mentioned
Table 4. Average coefficient of determination over all QTLs for models M1 and M2 in the second and third simulation scenarios.
that M2 did not explain any variation in the change of across breed A and B.
Across the three simulation scenarios, it is clear that a significant portion of the variation in variation in could be explained by information already available in the observed SNP marker data. Furthermore, the statistical model as well as the extent of the window of SNPs considered in the calculation of the parameters of the regression line plays a crucial role in estimating change in LD between markers and QTLs in both breeds. Based on the results of this simulation study and the structure of LD generated, it seems that small windows are preferable. This is true because including large number of SNPs with little to no LD with the QTL(s) will render the mean and standard deviation non-informative about the variation in . Using real data, the situation will be more complex due to a larger number of SNP markers and QTLs where the latter have a random and unknown distribution. In such case, information about LD blocks should be used in determining the length of SNP windows to be used. Additionally, the relationship between and the observed information in the SNP genotypes could be non-linear and cannot be approximated well with simple regression models.
In this simulation study, inferring change of linkage disequilibrium between marker and QTL between two pure breeds proved to be possible. This might help in inferring the change of SNP marker effects when having different breeds or lines in the population. Whether or not this could be used in genomic selection in the case of admixed populations, further testing and research is required.
Mention of trade names or commercial products in this publications solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA.
USDA is an equal opportunity provider and employer.