In the last two decades, there has been a push in psychological science to improve research reporting with an emphasis on effect size and confidence interval reporting (see American Education Research Association  ; Cumming  ; Wilkinson and the Task Force for Statistical Inference  ). Effect sizes communicate the magnitude and direction of a practically important effect (e.g., treatment decreased depression scores by 13%), and confidence intervals communicate this effect’s estimate precision. The importance of confidence intervals, their basic construction, and interpretation have thus been the focus of several influential pedagogical articles (e.g., see Cumming and Fidler  ; Cumming and Finch  ; Greenland et al.  ).
Most, if not all, modern introductory statistics textbooks review and describe the construction of confidence intervals (e.g., see Moore et al.  ). Let be a sample obtained from a normally distributed population with mean and variance . Then a confidence interval for can be calculated by
and t* is the percentile of the Student t distribution with degrees of freedom. Moreover, when the sample size n is large (usually stated n is larger than 30), then the confidence interval for can still be obtained from (1) except that we replaced t* by z*, which is the percentile of the standard normal distribution.
The fundamental assumption underlying the construction of this confidence interval is that the data are normally distributed. However, collected data are usually non-normally distributed in practice (for examples in psychology, see Cain et al.  ; Micceri  ). In public health research, Bland and Altman  reported that serum triglyceride measurements are distributed with positive skewness. In biology, McDonald  reported that the number of Eastern mud- minnows in Maryland streams are non-normally distributed.
In this paper, we compare various methods for constructing confidence intervals when data are non-normally distributed. Three of the most popular and commonly used methods are the method based on the Central Limit Theorem, the bootstrap method, and the back-transformation method, which are reviewed in Section 2. The parametric based Wald method and likelihood-based third order method are also discussed in Section 2. Note that the popular back-trans- formation method requires the existence of a transformation such that the transformed data are normally distributed. The selection of such a transformation by the Box-Cox transformation and the Tukey’s ladder of power transformation are briefly discussed in Section 2. Two empirical examples are presented in Section 3 to illustrate that confidence intervals based on the different methods discussed in Section 2 can be vastly different. Simulation results are presented in Section 4 to compare the accuracy of the methods discussed in this paper and illustrated that the likelihood-based third order method gives extremely accurate coverage probability even when the sample size is small, the Wald method, the Central Limit Theorem method and then bootstrap method all performed poorly when sample size is small but the performance increases when the sample size increases, and the popular back-transformation method should not be used because it does not construct the confidence interval for the correct parameter. Finally, some concluding remarks are given in Section 5.
This section reviews four commonly used methods, namely the Central Limit Theorem, bootstrap, back-transformation, and Wald for obtaining a confidence interval for the mean of a non-normal distribution. A very accurate likelihood-based method is also introduced in this section.
2.1. Central Limit Theorem Method
Let be a sample from a non-normal distribution with mean . When the sample size n is large, the Central Limit Theorem gives
where and . Since and are the unbiased
estimates of and respectively; by the Central Limit Theorem, an approximate confidence interval for is
where is the percentile of the standard normal distribution.
2.2. Bootstrap Method
The bootstrap method is a popular non-parametric method, which does not require any distributional assumptions. Efron and Tibshirani  provide a detailed review of the bootstrap method. The following is an algorithmic approach of obtaining a percentile bootstrap confidence interval for the population mean, .
Step 1: Resample the observed sample with replacement and calculate the sample mean for this bootstrap sample.
Step 2: Repeat Step 1 B times, where, typically, .
Step 3: Sort the B bootstrapped sample means; the and percentiles give the percentile bootstrap confidence interval for the population mean.
Note that as with the Central Limit Theorem method, the bootstrap method requires the observed sample size to be large so as to be representative of the population.
2.3. Back-Transformation Method
Recall that X is a non-normally distributed random variable with mean . Assume there exists a transformation such that is normally dis- tributed with mean and variance . By the delta method,
and an approximate confidence interval for from (1) is
It is important to note that (3) could be misleading because can be very different from . For example, if X follows a distribution, then is distributed as . It follows that the delta method gives . However, as shown in Table 1, , which can be quite different from , especially when is large. Consider another example where follows a distribution. Here, the delta method gives . However, Table 1 shows that , which can be quite different from , especially when is large.
The rest of this subsection is to provide a systematic way of choosing the transformation . In practice, the most common simple transformations are the logarithmic transformation and square root transformation. Box and Cox  proposed a more complicated transformation, which requires the determination of the power parameter. Similarly, Tukey  suggested a ladder of power transformation, which also requires the determination of the power parameter. We review Tukey’s method in a later subsection. With an observed sample , our aim is to obtain confidence intervals for . In this paper, focus is placed on the two most commonly used transformations in practice: the logarithmic transformation and the square root transformation. Note that the tansformation methods discussed can be generalized to any known transformation in theory (cf., Box-Cox or Tukey’s transformations).
When observed data are non-normally distributed, a common approach is to first apply a transformation such that the transformed data become somewhat
Table 1. Transformation and the parameter of interest.
normally distributed. In the statistical literature, two very similar families of transformations are frequently discussed: the Box-Cox transformation and Tukey’s ladder of power transformation. In particular, Osborne  gives a detailed discussion of the application of the Box-Cox transformation. Mathematically, the Box-Cox transformation and Tukey’s ladder of power transformation are very similar. Because Tukey’s ladder of power transformation is easier to interpret compared to the Box-Cox transformation, we review the ladder of power transformation and suggested criteria to choose an appropriate transformation to address non-normally distributed data below.
Tukey’s ladder of power transformation takes the form
where is called the power parameter of this transformation, where is chosen such that Y is approximately normally distributed. Moreover, should be chosen such that the power parameter is easy to interpret. Note that is equivalent to no transformation. In practice, the popular reciprocal transformation, logarithmic transformation, and square root transformation are equivalent
to and , respectively.
Table 1 presents the mean of distribution prior to transformation, , in terms of and based on the type of transformation used. Since does not exist for the reciprocal transformation, this transformation is not considered in this paper.
With an observed sample, we suggest the choice of be based on three criteria:
1. de-trended normal quantile-quantile (Q-Q) plot,
2. p-value of the Shapiro-Wilk test of normality, and
First, when the de-trended normal Q-Q plot deviates from the horizontal reference line which indicates identical quantiles between the data and a theoretical normal distribution, the plot suggests that the data are likely non-normally distributed. Second, simulation studies by Razali and Wah  illustrate that the Shapiro-Wilk test is the most powerful test among all formulated statistical tests for normality. Under the assumption of a normal distribution, the smaller the p-value associated with the Shaprio-Wilk test, the more evidence against the normality assumption. Thus, the transformation which gives the largest p-value of the Shapiro-Wilk test is associated with the least evidence against the transformed data being normally distributed. Finally, with regard to skewness, the normal distribution has skewness 0. In this vein, the transformation which results in a skewness value closest to 0 is most symmetric and would be the preferred transformation.
2.4. Wald Method
As in the previous subsection, we assume that X be a non-normally distributed random variable with mean and there exists a transformation such that is normally distributed with mean and variance . Moreover, .
The log-likelihood function concerning Y can be written as
where a is an additive constant. Without loss of generality, a is set to zero hereafter. The overall maximum likelihood estimate (MLE), denoted by , can be obtained by solving
Hence, we have
The observed information matrix is the negative of the second derivatives of the log-likelihood function with respect to the parameters:
The variance-covariance matrix for can be approximated by the inverse of Fisher’s expected information matrix, , which, in general, can be difficult to obtain in practice. Nevertheless, the variance-covariance matrix for can be approximated by the inverse of the observed information evaluated at the MLE, where
It is well-known that is asymptotically distributed as normal with mean and variance .
Recall that the parameter of interest is , and we denote . By the delta method,
Thus, an approximate confidence interval for is
For the case of the logarithmic transformation (i.e., Tukey’s ladder of power transformation where ), the parameter of interest is , where . Therefore, by the Wald method, a confidence in- terval for is
where and . Thus, an approximate
confidence interval for is
For the case of the square root transformation (i.e., Tukey’s (1977) ladder of
power transformation where ), the parameter of interest is .
Therefore, an approximate confidence interval for is given by (5), where
2.5. Likelihood-Based Third Order Method
Both the Central Limit Theorem method and Wald method have a theoretical rate of convergence of , and both the back-transformation method and the bootstrap method have no known rate of convergence. In recent years, many methods have been developed to improve the rate of convergence of existing asymptotic methods. In this subsection, we review the modified signed log-like- lihood ratio method by Barndorff-Nielsen  . The modified signed log-like- lihood ratio statistic is defined as
is the signed log-likelihood ratio statistic, is the constrained MLE obtained by maximizing the log-likelihood function for a given value, and is a statistic based on the log-likelihood function given in (4). Barndorff- Nielsen  showed that is asymptotically distributed as a standard normal distribution with a rate of convergence of . Thus, a confidence interval obtained based on is such that and satisfies , , and .
If the model is an exponential family model and the parameter of interest is a component parameter of the canonical parameter, Fraser  showed that is the standardized MLE statistic. Given a general model and this idea in mind, Fraser and Reid  first approximate the model using an approximate tangent exponential model to obtain the locally defined canonical parameter. Then, they express the parameter of interest in terms of the locally defined canonical parameter and also derived the variance the estimated parameter of interest in this locally defined canonical parameter scale. Thus, is the standardized MLE statistic expressed in the locally defined canonical parameter scale, and the modified signed likelihood ratio statistic can be used to obtain confidence interval for . Details of this algorithmic approach of obtaining is outlined below.
Notation: is the log-likelihood function;
is a k-dimensional vector of parameters;
is a k-dimensional vector of canonical parameters for the exponential family model;
is a scalar parameter of interest;
is the observed data.
Aim: Inference about .
Step 1: From the log-likelihood function, obtain the overall MLE, and can be obtained.
Step 2: Apply the Lagrange multiplier technique to obtain the constrained MLE at . More specifically, maximize
with respect to , where is defined as the Lagrange multiplier. Denote the result of the maximization be .
Step 3: Define the tilted log-likelihood function as
where is a fixed value. Obtain the constrained MLE either from the tilted log-likelihood function or from Step 2, and , which is the matrix of the negative of the second derivatives of the tilted log-likelihood function.
Step 4: The signed log-likelihood ratio statistic is
Step 5: Define
where is the first derivative of with respect to , and is the first derivative of with respect . This quantity is a recalibration of the parameter of interest in the canonical parameter space.
Step 6: The quantity measures the departure of from in space.
Step 7: The estimated variance for the departure in space is given by
Step 8: The standardized MLE departure under the scale is given by
Step 9: The modified signed log-likelihood ratio statistic is given by
Although the algorithm involves many steps, it can easily be implemented into algebraic or statistical software such as MATLAB, Maple and R.
3. Empirical Examples
In this section, the different methods of constructing a confidence interval about the mean of non-normally distributed data are illustrated with two empirical examples. We demonstrate that the results obtained by the methods discussed in this paper can be very different.
3.1. Example 1: Serum Triglyceride Measurements
Bland and Altman  considered n = 278 serum triglyceride measurements, which had a positively skewed data distribution with an average of 0.51 mmol/l and a standard deviation of 0.22 mmol/l. By applying a base 10 logarithm transformation to the data to obtain a less skewed distribution, the transformed distribution became bell-shaped with an average of −0.33 and a standard deviation of 0.17. By applying the Central Limit Theorem, they report a 95% confidence interval for the mean serum triglyceride measurements to be (0.48, 0.54). Using the back-transformation method, the corresponding interval is (0.45, 0.49). Table 2 presents the 95% confidence intervals for the mean serum triglyceride
Table 2. 95% confidence interval for the mean serum triglyceride measurements.
measurements for the alternative methods reviewed above Note that for this example, the bootstrap method cannot be applied because the original data set is not unavailable.
Bland and Altman  noted that the interval obtained by the back-trans- formation method is actually the 95% confidence interval for the geometric mean of serum triglyceride measurements instead of the mean serum triglyceride measurements, where the latter is the parameter of interest. Stated differently, the back-transformation method does not provide information about the focal parameter of interest (i.e., the mean of the non-normal distribution). From Table 2, it can be observed that the results from the Central Limit Theorem method are different from those obtained by the Wald method and third order method. Additionally, the Wald method and third order method give results which agree up to the second decimal place. This observation is not surprising because these two methods theoretically converge to the same answer when the sample size goes to infinity. The only difference is that the third order method will have a faster rate of convergence than the Wald method (i.e., versus , respectively). The different rates of convergence are more formally illustrated in Section 4.
3.2. Example 2: Abundance of Eastern Mudminnows
McDonald  reported on data on the abundance of Eastern mudminnows in Maryland streams, which is reproduced below:
These data are non-normally distributed and McDonald  suggested that both the logarithmic and square root transformed data are suitable for analysis because they are more normally distributed compared to the original and other competing transformations. His final analysis makes use of the logarithmic transformed data.
Table 3 presents the 95% confidence intervals for the mean of the non-trans- formed distribution obtained by applying the Central Limit Theorem method
Table 3. 95% confidence intervals for the mean of the abundance of Eastern mud- minnows in Maryland streams.
and the bootstrap method with B = 5000 to the original data; and the back- transformation method, Wald method, and likelihood-based third order method to both the logarithmic transformed data and square root transformed data.
The results obtained by the methods discussed in this paper are very different for different transformations. In particular, the logarithmic transformation results in a much larger upper bound of the interval compared to the square root transformation. Thus, it is essential to identify which transformation is more appropriate for a given set of data.
The de-trended normal Q-Q plots for the original data, logarithmic transformed data and square root transformed data are shown in Figure 1. From
Figure 1. De-trended Normal Q-Q plots for original and transformed data of the abundance of Eastern mudminnows.
these plots, it is obvious that the original data are not normally distributed because the points deviate from the horizontal reference line, which indicates identical quantiles between the data and a theoretical normal distribution. The two sets of transformed data are more closely normally distributed because the points in the de-trended normal Q-Q plots lie more closely to the reference line relative to the original data.
The Shapiro-Wilk test on normality of the original data gives a p-value of 0.1091. The same test on the logarithmic transformed data gives a p-value of 0.5261, and it gives a p-value of 0.6479 on the square root transformed data. Consistent with the de-trended Q-Q plot, the p-values of the Shapiro-Wilk test similarly suggest that the two transformed data sets are more likely to be normally distributed. Additionally, the empirical skewness of the original data, logarithmic transformed data, and square root transformed data are 0.5864, −0.4886, and 0.1632, respectively. These quantifications of skewness imply that the square root transformed data are more symmetrical than the original data and logarithmic transformed data. Thus, based on the criteria discussed in Section 2.3, the square root transformation is recommended for these data.
4. Simulation Study
A simulation study was carried out to compare the accuracies of the methods discussed in this paper. R code for the simulation is available to the interested reader upon request. For each combination, we generated 10,000 samples from . These are our simulated transformed samples, and the non-transformed (i.e., original) samples can be obtained by applying the inverse transformation to the simulated data. The transformations examined are the natural logarithm and square root. For each simulated sample, we computed a 95% confidence interval for the mean of the untransformed population from the five reviewed methods: Central Limit Theorem, bootstrap (B = 5000), back- transformation, Wald, and likelihood-based third order. The following quantities are recorded: the proportion of true means falling within the 95% confidence interval (coverage proportion), the proportion of true means less than the lower 95% confidence limit (lower error), and the proportion of true means greater than the upper 95% confidence limit (upper error). The nominal values of coverage, lower error, upper error, and bias are: 0.95, 0.025, and 0.025, respectively. We present only a small subset of the simulations we conducted to highlight several key points below, and other simulation results are available upon request.
Table 4 presents results with the natural logarithmic transformed data being
generated from and the parameter of interest is .
It can be observed that the likelihood-based third order method outperforms the other methods especially when the sample size is small; coverage, lower and upper errors associated with the likelihood-based third order method are relatively closer to nominal rates compared to the alternative methods. Among the remaining methods, the Central Limit Theorem method and the bootstrap
Table 4. 95% coverage probability for the logarithmic transformation case.
method give similar results. The Wald method seems to converge faster than the Central Limit Theorem and bootstrap methods. As discussed in Section 2, the back-transformation method gives unacceptable coverage probability because it is constructing confidence intervals about a parameter that is not of interest.
It can be observed that the likelihood-based third order method outperforms the other methods especially when the sample size is small; coverage, lower and upper errors associated with the likelihood-based third order method are relatively closer to nominal rates compared to the alternative methods. Among the remaining methods, the Central Limit Theorem method and the bootstrap method give similar results. The Wald method seems to converge faster than the Central Limit Theorem and bootstrap methods. As discussed in Section 2, the back-transformation method gives unacceptable coverage probability because it is constructing confidence intervals about a parameter that is not of interest.
Table 5 presents results with the square root transformed data being generated from and the parameter of interest is .
Similar to results in Table 4, we can observe that the likelihood-based third order method outperforms the other methods, especially when sample size is small. In this context, the Central Limit Theorem method and the bootstrap method give similar results and they seem to converge faster than the Wald method. The back-transformation method continues to give unacceptable coverage probability because it constructs confidence intervals about a parameter that is not of interest.
Based on these simulation results, the Central Limit Theorem method, bootstrap method and Wald method converge slowly relative to the likelihood-based third order method. Hence, we recommend using the likelihood-based third order method to obtain confidence intervals for the mean of the non-transformed distribution after applying a normalizing transformation to non-normal data, especially for small sample sizes or large departures from normality. It is important to note that researchers should not use the popular back-transformation method despite its simplicity except for the special case where Math_195#.
More simulations have been performed with the same pattern of results. They are not reported here, but are available upon request.
When interest is in constructing a confidence interval about a non-normal distribution, normalizing transformations are typically recommended as a first step. This paper recommends the use of de-trended normal Q-Q plots, the largest p-value of the Shapiro-Wilk test, and quantifications of skewness on the transformed data to determine the power parameter ( ) for Tukey’s ladder of power transformation when the exact transformation is unavailable. Our results strongly advise against using the popular back-transformation approach in applied work because it does not construct confidence intervals about the parameter of interest (i.e., the mean of the original distribution). Instead, we recommend the
Table 5. 95% coverage probability for the square root transformation case.
likelihood-based third order method because of its superior performance in terms of its rate of convergence, coverage, and accuracy relative to the Central Limit Theorem, bootstrap and Wald methods, even when the sample size is small or the distribution is far from being normal.
We thank the editor and the referee for their comments. This work was based on O.C.Y. Wong’s undergraduate honors thesis. J. Pek was supported by the Natural Sciences and Engineering Research Council of Canada Discovery Grant (RGPIN-04301-2014) and the Early Researcher Award by the Ontario Ministry of Research and Innovation (ER15-11-004). A.C.M. Wong was supported by the Natural Sciences and Engineering Research Council of Canada Discovery Grant (RGPIN-163597-2012).
 American Education Research Association (2006) Standards for Reporting on Empirical Social Science Research in AERA Publications. Educational Researcher, 35, 33-40.
 Wilkinson, L. and the Task Force on Statistical Inference (1999) Statistical Methods in Psychology Journals: Guidelines and Explanations. American Psychologist, 54, 594-604.
 Cumming, G. and Finch, S. (2001) A Primer on the Understanding, Use, and Calculation of Confidence Intervals that Are Based on Central and Noncentral Distributions. Educational and Psychological Measurement, 61, 532-574.
 Greenland, S., Senn, S.J., Rothman, K.J., Carlin, J.B., Poole, C. Goodman, S.N. and Altman, D.G. (2016) Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations. European Journal of Epidemiology, 31, 337-350.
 Cain, M.K., Zhang, Z. and Yuan, K.H. (2016) Univariate and Multivariate Skewness and Kurtosis for Measuring Nonnormality: Prevalence, Influence and Estimation. Behavior Research Methods, 1-20.