Revealing GE Interactions from Trial Data without Replications

Show more

1. Introduction

Replication is often required for valid data processing and statistical tests [1] . In addition, it provides a great prospect to dissect potential interaction effects among factors of interest like genotype-by-environment (GE) interaction [2] . It is very common for most multi-environment crop trial data like multi-location and multi-year trial data sets to be reported by various institutions and made available for public users. However, most of the published crop trial data only include entry means under each environment rather than the original repeated field plot data. Without replications, it would be statistically challenging to detect GE interaction effects, which are highly related to yield stability, from the crop trial reports. Therefore, it would be a great addition to develop a new method that could be effectively used to detect GE interactions when original replicated field trial data are not available.

Because the entire original data can be treated as missing data, in order to successfully recover the potential genetic information, we will need an imputation method to generate “alternative replicated field data” that can be used to recover the information from the original data. There are two major categories of imputations: single imputation (SI) and multiple imputation (MI). With SI, missing values are filled by some type of predicted values like mean imputation, regression imputation, and/or matching methods [3] [4] [5] . Although SI has been widely used, one shortcoming is that it does not reflect the full uncertainty created by missing data and almost always underestimates the variance. For example, the regression imputation method is based on an estimated regression model to predict or impute the missing values. This could cause relationships to be over-identified and suggest greater precision in the imputed values than is warranted. In order to deal with the problem of increased noise due to data imputation, MI, which repeats multiple times resulting in multiple imputed data sets, is recommended, especially when data are missing at random (MAR) [4] . With MI, the imputation uncertainty is accounted for by creating these multiple data sets. The MI follows three basic steps: imputation, analysis, and pooling [6] [7] . With MI bias can be reduced and estimates are more precise. MI has several desirable features. The first feature is that introducing appropriate random error into the imputing process makes it possible to get approximately unbiased estimates of all parameters. The second feature is that repeated imputation allows researchers to get better estimates of the standard errors. The third feature is that MI can be used with any kind of data and analyses.

Unlike many other missing data being imputed, it is well-known that the entire original field measurements were unavailable except only entry means under each environment. Therefore, an important step is to propose probability density function for each entry/genotype based on the published results that can be used to impute entire “original data” so that the genetic information including GE interactions harbored in the original data can be detected, accordingly. In the present study, our objectives included 1) to propose a new procedure to generate a new data set with repeated measurements from given entry means and 2) to numerically validate the new method with a data set containing six locations, 28 potato genotypes, and three replications in each of six locations [8] . The purpose of this study is to provide an alternative method and computer tool to improve data analysis and statistical tests and thus to reveal more information harbored in the historical crop trial data when replications were not available.

2. Materials and Methods

2.1. Linear Model for GE Analysis

The linear model used for an observation ${y}_{hij}$ , which represents the environment h, the genotype i, and the block j nested to the environment h, can be expressed as follows:

${y}_{hij}=\mu +{E}_{h}+{G}_{i}+G{E}_{hi}+{B}_{j\left(h\right)}+{e}_{hij}$ (1)

In order to detect GE interaction effects, replication with each environment is required. Without replication, the GE interaction effects and random error are confounded and they cannot be separated and the GE interaction and block effects should be omitted from model (1).

2.2. Model Used for Data Imputation

The linear model for an observation under a single environment can be described as model (2) without including environmental and GE interaction effects:

${y}_{ij}=\mu +{G}_{i}+{B}_{j}+{e}_{ij}$ (2)

In model (2), ${G}_{i}$ may include GE interaction effect where it may exist. If we assume block effects and random error follow two independent normal distributions, then ${y}_{ij}$ follows the following normal distribution in (3)

${y}_{ij}~N\left(\mu +{G}_{i},{\sigma}_{B}^{2}+{\sigma}^{2}\right)$ (3)

Given the above distribution in (3), if we know $\mu +{G}_{i}$ and ${\sigma}_{B}^{2}+{\sigma}^{2}$ , we can generate ${y}_{ij}$ under each single environment. Blocking is used for local control of field variation within each environment; however, block effects may not impact the results of variance components for genotypic effects and random error and prediction of genotypic effects if model (2) is applied. Therefore, to simplify, we can assume there are no block effects and they can be omitted during the data imputation process. If so, Equation (3) can be simplified as in the following normal distribution in (4) when there are not block effects:

${y}_{ij}~N\left(\mu +{G}_{i},{\sigma}^{2}\right)$ (4)

Actual values for $\mu ,\text{\hspace{0.17em}}{G}_{i}$, and ${\sigma}^{2}$ are unknown. If we can substitute $\mu +{G}_{i}$ and ${\sigma}^{2}$ with estimates $\stackrel{^}{\mu}+{\stackrel{^}{G}}_{i}$ and ${\stackrel{^}{\sigma}}^{2}$ , then we can impute each ${y}_{ij}$ accordingly.

${\stackrel{^}{y}}_{ij}~N\left(\stackrel{^}{\mu}+{\stackrel{^}{G}}_{i},{\stackrel{^}{\sigma}}^{2}\right)$ (5)

where $\stackrel{^}{\mu}$ is an estimated population mean; ${\stackrel{^}{G}}_{i}$ is an estimated/predicted genotypic effect for genotype i and ${\stackrel{^}{\sigma}}^{2}$ is an estimated variance for random error. In many trial reports, individual genotypic means under each environment were available and thus can be used to substitute $\mu +{G}_{i}$ and mean square error (MSE) can be used to substitute ${\sigma}^{2}$ . MSE value for each environment can be derived from the coefficient of variation or least significant difference (LSD).

2.3. Data Source

The data set (plrv) used for our imputation analysis, as a demonstration, is currently available in the R package agricolae [8] . The data set contains six environments, 28 potato genotypes, and three replications in each environment. There were three agronomic traits in the data while only yield was used for this study. The major reason for using this data set as a demonstration is that it is publicly available [8] and interested parties can generate repeatable results via the codes developed by the author of this study.

2.4. Data Imputation and Analysis

Data imputation: Phenotypic means for yield for 28 potato genotypes at each of six locations were calculated. The unit used for potato was not provided in the package. Interested readers may contact the package developer for more detailed information. With the data, MSE for each environment was calculated by the ANOVA method subject to model 2. Both phenotypic means and MSE under each environment were used to generate imputed data. Assuming that data were normally distributed, individual observations with no block effects for each environment were imputed following the normal distribution Equation (5) with the use of the norm R function [9] . The process is repeated for each of six environments. The number of replications for data imputation was the same one used in each original experiment. All imputed data sets are combined as a multi-environment data set for the following analysis. In order to compare the impact of replications, the imputed data were repeated for 10, 20, 50, 100, 200, and 500 times, respectively. Both results from individual imputed data sets and pooled results are reported and compared.

Data analysis: First, phenotypic means for different genotypes in each environment were calculated for the original data set and each multi-environment data set. Second, linear mixed model (LMM) approaches such as restricted maximum likelihood (REML) and minimum norm quadratic unbiased estimation (MINQUE) [10] [11] can be used to analyze each imputed data set subject to model (1) mentioned above. Variance components for genotypic effects and GE interaction effects were estimated by MINQUE approach [12] . Genotypic effects and GE interaction effects were predicted using the adjusted unbiased prediction (AUP) method [13] . Mean and its confidence interval (CI) of 95% for each parameter were calculated. All data analyses were conducted under the R environment [9] . The MINQUE package [14] with minque approach for variance component estimation and AUP approach [13] for random effect prediction was used for our imputed data analysis. The R scripts for data imputation and other related data analyses were developed by the first author of this study and will be available upon request.

3. Results

3.1. Original and Imputed Phenotypic Means for 28 Potato Genotypes

The phenotypic means for 28 potato genotypes under six environments calculated from the original data set are provided in Table 1. Generally, wider ranges among six environments were observed compared to the ranges among genotypes within each environment (Table 1), indicating that environmental effects played a more important role on yield than genotypes. Some genotypes were observed to be more adapted to specific environments. For example, Canchan was more adapted to Hyo02 (47.78) but Desiree was less adapted to the same environment (8.89). On the other hand, genotype Desiree was more adapted to the environment SR03 (11.42) than Canchan to the same environment (2.42), indicating that genotype-by-environment (GE) interactions also played an important role on potato yield. Therefore, it will be interesting to investigate GE interaction effects in the yield trial analysis.

Phenotypic means and their 95% confidence intervals (ranges of 2.5% and 97.5% percentiles) for 28 entries under each environment over 50 imputed data sets are provided in Table 2. Comparing the results in Table 2 and Table 1, the imputed means and original means were close to each other with a maximum difference of 1.35 and a mean difference of 0.28. The correlation coefficient between the original phenotypic means and imputed phenotypic means was almost 1.0.

Table 1. Individual phenotypic yield means for 28 genotypes in each of six environments+.

+: The original data set is available in R package agricolae [8] ; however, the unit for yield is not provided.

Table 2. Phenotypic yield means for 28 genotypes (IMean) over 50 imputed data sets and their 95% confidence intervals (LL = low limit and UL = up limit) in each of six environments.

The results implied that the imputed phenotypic data represented the original data. In addition, the simulated 95% confidence intervals were highly related to the mean square error (MSE) for each of six environments. The largest confidence intervals were observed in environment LM-03 with the largest MSE of 87.08 and while the smallest confidence in SR-03 due to the small MSE in this environment (Table 2).

3.2. Imputed Entry Means for 28 Potato Genotypes

Correlation coefficients between phenotypic means from the original data and five sets of imputed means were obtained and are presented in Figure 1. The correlation coefficients between original phenotypic means and five sets of imputed phenotypic means were around 0.98 while coefficients among five sets of imputed phenotypic means were around 0.96. The results showed that phenotypic means obtained from each individual imputed data set were also highly consistent, implying that the imputed phenotypic means well represented the original phenotypic mean data.

3.3. Pooled Results

Due to some degree of uncertainty of imputed data, multiple imputed data sets were applied to reduce the bias potentially caused by single imputed data. The question is how many imputed data would be sufficient to adjust the bias. As mentioned in this study, we generated 10, 20, 50, 100, 200, and 500 imputed data sets, which were used to obtain the pooled phenotypic means for 28 genotypes under six environments, mean variance components for environment effects, genotypic effects, GE interaction effects, and random errors, and mean predicted environment effects, genotypic effects, and GE interaction effects. However, due to large amount of results, only summarized results were provided.

Figure 1. Correlations among individual phenotypic means from the original data set and five imputed data sets. OM = individual means from the original data set. I1 to I5 = individual means from the 1st five imputed data sets.

Figure 2 showed that phenotypic means from the original data set were highly correlated and consistent with pooled phenotypic means from multiple imputed data sets (correlation coefficients were almost close to 1). Figure 3 showed that predicted environmental effects from the original data were highly consistent with the pooled predicted environmental effects from different imputed data sets (correlation coefficients among these predicted environmental effects were close to 1). The same conclusions can be made for predicted genotypic effects (Figure 4) and predicted GE interaction effects (Figure 5). These results suggested that 10 repeated imputed data sets were sufficient to obtain unbiased phenotypic means and predicted environmental effects, genotypic effects, and GE interaction effects.

Figure 2. Correlations among individual phenotypic means from the original data set and individual phenotypic means from different multi-imputed data sets. OM = individual means from the original data set. IM10, IM20, IM50, IM100, IM200, and IM500 = pooled phenotypic means from 10, 20, 50, 100, 200, and 500 imputed data sets.

Figure 3. Correlations among predicted environmental effects from the original data set and mean environmental effects from different multi-imputed data sets. OE = environmental effects from the original data set. IE10, IE20, IE50, IE100, IE200, and IE500 = pooled environmental effects from 10, 20, 50, 100, 200, and 500 imputed data sets, respectively.

Figure 4. Correlations among predicted genotypic effects from the original data set and mean genotypic effects from different multi-imputed data sets. OG = genotypic effects from the original data set. IG10, IG20, IG50, IG100, IG200, and IG500 = pooled genotypic effects from 10, 20, 50, 100, 200, and 500 imputed data sets.

Figure 5. Correlations among predicted GE interaction effects from the original data set and mean GE interaction effects from different multi-imputed data sets. OG = GE interaction effects from the original data set. IG10, IG20, IG50, IG100, IG200, and IG500 = mean GE interaction effects from 10, 20, 50, 100, 200, and 500 imputed data sets.

In summary, the results from imputed data were highly consistent with those results from the original data set, which includes replication. The results used for our comparisons included phenotypic means, environmental effects, genotypic effects, and GE interaction effects. In addition, it appears that pooled results from 10 repeated imputed data sets were almost identical to the results from the original data set with replications.

4. Discussion

Crop trial data can provide important information to researchers. Revisiting the historical data and discovering more information will help researchers reveal more genetic information in different respects. As mentioned above, however, many published trial data are summarized and the capability of using summarized data rather than the original repeated field plot data can be limited due to the lack of repeated field data. Therefore, it is crucial to generate new data sets that can be used to reveal genetic information comparable to the results from the original data with replications. This was our motivation to propose a new methodology in this study.

The key component in data imputation is to determine appropriate probability models, which can be used to generate simulated data to substitute multiple missing data points. Therefore, data imputation in this study can be considered as a simulation technique given particular probability models. Though the original field data from multi-environment crop trials were not available, the results such as entry means, numbers of replications, and mean square error provide information to determine a probability density function for each genotype/entry under each environment. With such a probability model for each genotype/entry, the entire data with replications can be imputed. Once data are imputed, various statistical data analyses for the imputed data can be followed like a linear-mixed model analysis in this study.

Due to the uncertainty of single imputed data set, multiple imputed data sets have been applied in this study to reduce the potential bias for each parameter. The question is how many independent imputed data sets are sufficient to represent the results from the original data set. Based on a demonstration data set, which is available in the R package agricolae, the correlation coefficient was around 0.98 between the phenotypic means from each of five individual imputed data sets and the phenotypic means from the original data while correlation coefficients was around 0.96 for the phenotypic means among five individual imputed data sets (Figure 1), showing that each imputed data set could be used to substitute the original data with replications. Our results also showed that 10 imputed data sets could sufficiently adjust the bias for this demonstration data set. However, it is likely that more imputed data sets would be required for a large MSE. It is possible sometimes that MSE values are not available on trial reports, one possible solution is that using a wide range of MSE values to impute multiple data sets. Such finding is important when individual MSEs in different environments are not available.

Though the method proposed in this study could help determine GE interaction from imputed data sets and increase the likelihood for statistical test and result validation, the imputation methods are based on the assumption of normal distribution for the original data, on which ANOVA analysis and mean comparisons are based. Many original field trial data included repeated field block; however, this information is often unavailable in the report. Our study showed that results from the imputed data without block effects were highly consistent with the results from the original data including blocks. Therefore, for simplicity, block effects could be omitted during the process of imputing trial data. In addition, data were imputed based on the MSE under each environment; however, it appears that imputed data based on individual MSE under each environment and MSE over environments yielded almost identical results, suggesting either individual MSE or pooled MSE over environments could be used to impute trial data.

Statistical tests for each parameter of interest could follow several approaches. The first possible approach is jackknife based technique [15] . The second possible approach is to use a confidence interval test. With a large number of imputed data sets, we could construct a confidence interval (CI) for 95% or 99% and a CI statistical test can be employed. With the second one, a large number of imputed data sets will be required to provide more reliable CI tests for parameters of interest. Thus, the second approach could be computationally intensive if the original data set was large. However, with high-power servers and/or parallel algorithms, the time used to generate and analyze a large number of imputed data sets could be trivial.

Acknowledgements

This study was partially supported by USDA-NIFA hatch project (SD00H525-14) and South Dakota Soybean Research and Promotion Council (SD1900233).

References

[1] Kuehl, R.O. (1999) Design of Experiments: Statistical Principles of Research Design and Analysis. 2nd Edition, Duxbury Press, Pacific Grove.

[2] Gray, E. (1982) Genotype × Environment Interactions and Stability Analysis for Forage Yield of Orchardgrass Clones. Crop Science, 22, 19-23.

https://doi.org/10.2135/cropsci1982.0011183X002200010005x

[3] Enders, C.K. (2010) Applied Missing Data Analysis. The Guilford Press, New York.

[4] Eekhout, I., et al. (2012) Missing Data: A Systematic Review of How They Are Reported and Handled. Epidemiology, 23, 729-732.

https://doi.org/10.1097/EDE.0b013e3182576cdb

[5] Roth, P.L. (1994) Missing Data: A Conceptual Review for Applied Psychologists. Personnel Psychology, 47, 537-560.

https://doi.org/10.1111/j.1744-6570.1994.tb01736.x

[6] Rubin, D.B. (1987) Multiple Imputation for Nonresponse in Surveys. Wiley, New York.

https://doi.org/10.1002/9780470316696

[7] van Buuren, S. (2012) Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton.

https://doi.org/10.1201/b11826

[8] De Mendiburu, F. (2017) Agricolae: Statistical Procedures for Agricultural Research.

[9] R Core Team (2012) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.

[10] Rao, C.R. (1971) Estimation of Variance and Covariance Components MINQUE Theory. Journal of Multivariate Analysis, 1, 257-275.

https://doi.org/10.1016/0047-259X(71)90001-7

[11] Patterson, H.D. and Thompson, R. (1971) Recovery of Inter-Block Information when Block Size Are Unequal. Biometrika, 58, 545-554.

https://doi.org/10.1093/biomet/58.3.545

[12] Zhu, J. (1989) Estimation of Genetic Variance Components in the General Mixed Model. North Carolina State University, Raleigh.

[13] Zhu, J. (1993) Methods of Predicting Genotype Value and Heterosis for Offspring of Hybrids. Journal of Biomathmatics, 8, 32-40.

[14] Wu, J. (2014) Minque: An R Package Fir Linear Mixed Model Analyses.

http://cran.r-project.org

[15] Bondalapati, K.D., Wu, J. and Glover, K.D. (2014) An Augmented Additive-Dominance (AD) Model for Analysis of Multi-Parental Spring Wheat F2 Hybrids. Australian Journal of Crop Science, 8, 1441-1447.