Ecological studies often involve data collected over time. Examples are observations of population densities such as the relative abundance of the white sea urchin (Lytechinus anamesus) in an area offshore the San Onofre Nuclear Generating Station (Schroeter et al.  ), or of the difference in chlorophyll concentrations between two lakes (Carpenter et al.  ). This data need not satisfy the standard assumption of independent observations but can in fact be autocorrelated. A long-memory time series has autocorrelation that decays at a slow hyperbolic rate. Long-memory has been shown to be effective in modeling natural processes such as observations of the yearly minimal water levels of the Nile River and the monthly temperatures for the northern hemisphere (Beran  ).
Researchers have explored the relationships among long-memory, aggregation, and structural breaks in time series    . In  the authors show that the number of spurious breaks in a long-memory series approaches infinity as the sample size does, while in   and  the authors explore the fact that structural breaks in a time series can create spurious long memory. In  the authors propose a test to detect spurious long memory using aggregation tests. Unfortunately, these aggregation tests depend upon very long time series, which is rarely the case in ecological experiments, making undetectable long memory a real danger.
Tests which seek to detect breaks due to an intervention in a series whose true data generating process is long memory are in danger of detecting spurious breaks. Later in this paper we examine two examples taken from the literature. They were chosen because they present instances in the literature where a significant intervention effect was detected when in fact no intervention occurred. A possible explanation for this is the presence of strong correlation in the data, perhaps long-memory, which could have produced the spurious detection of a significant intervention effect. This possibility provided the impetus for this manuscript.
The Before-After-Control-Impact (BACI) design  uses two ecological units, one as a control and the other as an impact, i.e., the impact unit has an intervention applied to it. Repeated measurements are taken on each of the units, before and after the intervention. The paired in time differences between impact and control units are the object of statistical analysis. The original BACI analysis uses a 2-sample t-test to compare the pre-intervention mean paired difference with the post-intervention mean paired difference. An alternative to the 2-sample t-test is Randomized Intervention Analysis (RIA)  , which uses a permutation test to conduct the comparison. BACI and RIA both assume the pre-intervention differences and the post-intervention differences from the repeated measurements on the control and the impact units form two independent random samples.
In lieu of ignoring the autocorrelation, strategies have been proposed to adjust the BACI and RIA analyses for autocorrelated data. One approach is parametric: estimate the correlation structure using an assumed (short memory) model and use the estimated correlation to adjust the 2-sample t-test and confidence interval in the BACI analysis (see Bence  for a survey). A second option is to adopt a nonparametric approach, whereby the original data is block resampled, the blocks being groups of observations chosen large enough to take the correlation into account. RIA is essentially block resampling using blocks of size one. This is completely accurate only where the observations are independent. Neither approach is entirely satisfactory when the data is collected from a long-memory process, leaving the experimenter vulnerable to spurious break detection.
This paper is structured as follows: Section 2 contains definitions and some simple derivations. In Section 3 we conduct numerical studies to illustrate the inadequacy of short-memory corrections for long-memory series. Section 4 contains two examples from the literature, and Section 5 concludes the paper with a brief discussion.
2. Definitions and Derivations
2.1. Time Series
The following facts from time series theory and methodology may be found in standard texts such as  . The autocorrelation structure of a stationary time series is described by its autocorrelation function (ACF) denoted , where h denotes the time lag between the two random variables:
A short-memory time series has an ACF that decays at an exponential rate, i.e., approaches a positive constant as for some . A long-memory time series has an ACF that decays at a hyperbolic rate: approaches a positive constant as for some .
Suppose is a white noise process. We consider the following short- memory process, the autoregressive model of order 1 (AR(1))
The ACF of the AR(1) is given by
Long-memory processes, as described by fractionally differenced white noise (FD(d)), define the time series by
where B defined by is the backshift operator and the fractional differencing operator has the polynomial expansion
The exact form of the ACF of the FD(d) process is known (see  pgs. 63ff.):
The AR(1) and FD(d) are both stationary processes. Fractionally differenced white noise is a classic long-memory time series, having an ACF that decays at a hyperbolic rate: approaches a positive constant as . The AR(1) model is short-memory since its ACF converges to zero at an exponential rate as .
2.2. BACI Analysis
Consider the BACI design. Suppose are observations on the impact site, are observations on the control site and are the differences between the two:
Assume and are jointly stationary, yielding a stationary , and that . As is well-known, if form a random sample then . However, when are realizations from a stationary time series with autocorrelation function , then
is sometimes called the variance correction factor   .
The estimated correction factor
is used to adjust the usual estimate of , , when are realizations of a stationary time series. Bence  made an extensive investigation of the effect of the estimated correction factor when the autocorrelation is assumed to be that of an AR(1) model and is estimated by , where is an estimate of the autoregression coefficient.
The BACI design uses a 2-sample t-test to compare the pre-intervention and post-intervention control-impact mean differences. Let denote the pre-intervention differences, the post-intervention differences, , and be the estimated standard error of where
The estimated standard error is calculated using (2) and the estimated variance correction (4). The estimates and are obtained from pooling the two sets of differences. Note ignores the correlation between the two samples since
This is another reason for the inacccuracy of the method in the presence of long-memory; for a short-memory process the problem will not be as severe.
The assumption that and are jointly stationary allows the use of the 2-sample t-test with equal variances. Combined with the null hypothesis of no intervention effect, this suggests the following approximate test statistic for the 2-sample t-test
The use of the t-distribution depends on asymptotic theory which requires very large samples when the process is long-memory. For smaller samples it is not exactly correct, but it is difficult to work out the exact distribution (  , Section 8.6.3).
2.3. RIA and Permutation Tests
An alternative to using a correction factor for the standard error of the mean is the use of nonparametric methods. The procedure is to resample blocks (the block bootstrap), the blocks being chosen large enough to properly capture the autocorrelation. The permutation test used in RIA is essentially block resampling from blocks of size one. This can be effective where the correlation structure is that of a short-memory process since blocks of minimal size are required. However, when long-memory is present the blocks must be large, requiring very large samples.
The problems that correlated data pose for RIA have been studied previously. One examination is in Carpenter et al.  . The authors simulated data from short-memory AR(1) and MA(1) processes and analyzed these with a permutation test. Recognizing RIA is affected by autocorrelations, the authors recommended a correction to the p-value when dealing with positive autocorrelations, for example, using a declared p-value of 0.01 to get a true p-value of 0.05. However, the autocorrelations used in this study were short-memory and moderate at most in strength. The simulation results in Section 3 suggest such a correction is not adequate for data with long-memory, as in a FD(d) process.
All simulations were run using the R environment  . The R package fracdiff  calculates the maximum likelihood parameters of a FD(d) model, following Haslett and Raftery  . There is a large body of literature concerned with estimating the long-memory parameter d, but this is not the focus of this paper. The package also contains a routine which will simulate observations from the process. In all simulations we assumed the white noise process followed a standard normal distribution.
3.1. Variance Correction Factors
The correction factors for FD(d) and AR(1) processes can be computed from (1) and (3) when the values of are known. Table 1 below contains the
Table 1. Variance corrections of autoregressive (AR) and fractionally differenced (FD) models, rounded to 2 decimal places. Parameter values are in parentheses.
correction factors for the FD(d) process for and those of the AR(1) process for values . Several sample sizes were used.
The most striking difference between the AR(1) and the FD(d) correction factors are the rates at which they increase as the sample size increases. FD(d) processes increase more due to the fact the autocorrelations persist longer. For small sample sizes the AR(1) corrections tend to be equal to or slightly larger than the FD(d) corrections, while for large sample sizes they are too small.
Bence  observes that setting to 0.99 or 0.9999 makes little difference because in either case the confidence bounds will be so broad that little could be claimed on the basis of the estimate. AR(0.99) is competitive for small samples, but for moderate to large samples it underestimates the variance correction factor; setting is too conservative for data with very strong long- memory, indicating how serious the situation is. Also, simulated data from has a reasonable probability of returning a non-stationary AR(1) model, particularly for smaller samples, rendering the short-memory correction unusable.
3.2. Size of 2-Sample t Hypothesis Tests with and without AR(1) Variance Corrections
To investigate the size of 2-sample t-tests when the data are from a long memory process, series of various lengths for several values of d were simulated. Each simulated series was split into two equal halves to be the two series. The case corresponds to white noise for the errors. The t-test statistic was calculated both with and without the AR(1) variance correction, and the null was rejected if the test statistic exceeded the appropriate critical value. The proportion of rejections was the estimated size of the test. The AR(1) correction used the value of estimated from the simulated series. Results are in Table 2.
For white noise ( ) processes the uncorrected and AR(1) corrected tests have size approximately equal to the nominal size. As d increases, the sizes of the tests increase. Though the AR(1) performs better than no correction, the performance is very poor for strong long-memory. Also, in the presence of long-memory the size of the test increases from the nominal size as the sample size increases, the performace being worse the stronger the long-memory.
3.3. Randomized Intervention Analysis
Carpenter et al appear to have introduced RIA in  . The RIA permutation test examines all possible permutations of the observed pre-intervention and post-intervention differences, determines a distribution from this for the absolute value of the difference between the pre-intervention and the post- intervention means and uses this distribution to compute a p-value for the observed data. The case where there is no intervention effect is equivalent to splitting a single series of differences into two series at the time of the ineffective intervention, and then comparing these two series with a permutation
Table 2. Hypothesis test size Monte Carlo approximations for 10,000 simulated FD(d) with , and with . “None” denotes the ordinary 2-sample t-test, “AR” denotes the 2-sample t-test with the AR(1) correction.
test. The permutation test assumes the differences are independent, an assumption violated by data possessing long-memory.
Computing the exact p-value for a permutation test can be computationally taxing even for moderate sample sizes. The p-value can be approximated via Monte Carlo methods, using random assignments of the data to each of the two samples. The estimate of the p-value is taken to be the ratio of the number of random assignments resulting in an absolute mean difference that meet or exceed the observed difference to the number of random assignments. Since the aim of the simulation is to approximate the distribution of the p-value returned for RIA applied to a FD(d) time series, Monte Carlo methods are again applied to simulate many realizations from a FD(d) process and an approximate p-value is calculated for each.
Carpenter et al. recognized RIA is affected by autocorrelations. They simulated data from short-memory AR(1) and MA(1) processes and ran these through RIA, also checking these for true rejections when a given intervention of size ms occurred, that is, sizes that are multiples m of the standard deviation. As a result they recommend a correction to the p-value when dealing with positive autocorrelations, i.e., using a declared p-value of 0.01 to get a true p-value of 0.05.
Table 3. Quartile summary of Monte Carlo distribution results of estimated RIA p-values for 10,000 random permutations of each of 1000 simulated FD(d) series with the indicated values of d and n.
In the simulation a permutation test was applied to each of 1000 simulated long-memory FD(d) series, of the values of d and n indicated. The estimated permutation test p-values were based on 10,000 random permutations of each simulated data set. Note the simulated long-memory series contain no intervention but as mentioned do strongly violate the assumption of independent observations behind the permutation test. Estimated quartiles of the p-value distributions are summarized below in Table 3.
For fixed d, as the sample size n increases, the distribution becomes increasingly right-skewed, with the p-values increasingly concentrated near zero. This is also true for fixed n, as the long-memory parameter d increases. The simulation results indicate long-memory data analyzed with a permutation test will result in many false detections of trend or intervention.
4. Data Examples
As mentioned, the R package fracdiff  calculates the maximum likelihood parameters of a FD(d) model, following Haslett and Raftery  . The log- likelihood from fitting an FD(d) model to data is optional output and can be used to test for long-memory vs. the null hypothesis of white noise. Asymptotically, −2 times the log-likelihood follows a Chi-Square distribution under the null hypothesis. However, fearing slow convergence due to long-memory (for long-memory time series the rate of convergence in the Central Limit Theorem occurs at rate rather than . (See Beran  , Taqqu  )) it is wise to approximate the p-value using Monte Carlo methods. This is accomplished by simulating many white noise series, fitting a FD(d) model to the simulated series and calculating the log likelihood. The estimated p-value is the proportion of simulated series with a value of −2 times the log likelihood which exceeds the observed value of −2 times the log likelihood. Another approximate p-value for a test for the significance of the observed value of d can be computed by calculating the proportion of simulated series with an estimated d which exceeds that observed. The exact hypotheses being tested are (white noise) vs. (long-memory in the form of a FD(d) model).
The following two examples were taken from the literature. They were chosen because they present instances in the literature where the BACI analysis with the short-memory AR(1) variance correction and the RIA analysis utilizing a permutation test returns a significant intervention effect when in fact no intervention occurred. A possible explanation for this is the presence of strong correlation in the data, perhaps long-memory, which could have produced the spurious detection of a significant intervention effect. The observations in the following examples are only approximately equally spaced in time. They were assumed so in order to simplify the analyses.
4.1. Sea Urchin Data
The first example involves data read from figure 4a in Bence  , concerning the relative abundance of the white sea urchin (Lytechinus anamesus) in an area offshore the San Onofre Nuclear Generating Station. The data first appeared in Schroeter et al.  . The data values are differences in log-transformed density (numbers per square meter) of white sea urchins between an impact site and a control site; a plot appears in Figure 1. It is important to note there was no intervention in the series, despite the apparent and unexplained structural break prior to mid-1981.
The analysis by Bence estimated the mean difference with a t confidence interval. He assumed an AR(1) correlation structure after the Durbin-Watson test detected significant autocorrelation. However, estimation returned a non- stationary model, ruling out the AR(1) and another indication the data may possess long-memory.
Fitting a long-memory model to the sea urchin data yielded the estimate . The value of the approximate chi-square test statistic equaled 34.66, with a Monte Carlo approximate p-value of 0.00161. The test for significance of
Figure 1. Approximate differences in log-transformed density (numbers per square meter) of white sea urchins between an impact site and a control site (see   ); the approximate interlake differences in chlorophyll concentrations between Big Muskellunge and Trout lakes, 1984-1986 (see Carpenter et al.  ).
the long-memory parameter yielded a Monte Carlo approximate p-value of 0.01222. The long-memory corrected 95% confidence interval (using (3) with the estimated FD(d) ACF using ) for the mean difference is , indicating the mean difference is equal to 0. This compares with the (from Bence  ) OLS estimate ( ), conventional 2-stage estimate ( ) and bias-corrected 2-stage estimate ( ). Moderate long-memory in the data is a possible explanation for the non-stationary AR(1) fitted model and the unexpected structural break.
4.2. Interlake Differences
Carpenter et al. (  , in figure 5) report an example using the difference in chlorophyll concentrations between two lakes (Big Muskellunge Lake and Trout Lake in the Northern Highlands Lake District of Wisconsin). This data was read from the figure and a plot appears in figure 3. RIA was significant even though no effect was evident from the mid-1985 intervention. The plot of the data reveals a possible trend; no explanation was given for this.
Durbin-Watson does not detect a statistically significant autocorrelation at lag 1 (p-value = 0.346), ruling out the AR(1). The fitted FD(d) model yielded the estimate . The value of the approximate chi-square test statistic equaled 81.21, with a Monte Carlo approximate p-value of 0.09075. The test for the significance of the long-memory parameter yielded a Monte Carlo approximate p-value of 0.05785. Weak to moderate long-memory in the data is a possible explanation of the significance of RIA, creating a false trend which was detected by the permutation test as a spurious break due to the intervention.
Murtaugh   and Stewart-Oaten  debated the effectiveness of the BACI and RIA designs. However, their points concerned incorrect specification of the process mean structure and not the process autocorrelation structure. An examination of RIA for correlated data is in Carpenter et al.  . However, the autocorrelations studied were short-memory and moderate at most in strength. Bence  , recognized short-memory variance corrections may not always be adequate for ecological data, that the actual correlation structure of the process may be more elaborate than that of a short-memory process. The simulation results in Section 3 indicate short-memory variance corrections in the 2-sample t-test used in the BACI analysis are not adequate for data with long-memory.
However, the BACI design and analysis will work better than RIA in these situations because it is amenable to a simple long-memory variance correction which will improve its performance. It is also known  that the sample mean is nearly optimal when estimating the location parameter of a Gaussian long-memory series, in the sense that the sample mean is unbiased and its efficiency when compared to the best linear unbiased estimator is very high.
Researchers have examined the relationships among long-memory, aggregation and structural breaks in time series    . Tests such as RIA which seek to detect breaks in a series whose true data generating process is a long memory process may result in spurious break detection. In the other direction, structural breaks in a time series may cause the manifestation of long-memory behavior. RIA would not be sensitive to structural breaks in a series since RIA does not detect the time at which a change occurred.
One solution is to detect and account for the breaks in a series, correct for them and then analyze the corrected time series. However, aggregation tests  for long memory depend upon very long time series, which is rarely the case in ecological experiments. Researchers should consider using long-memory corrections when short-memory corrections return non-stationary models, data exhibits persistent autocorrelation, an intervention where none occurred, or a trend with no apparent explanation.