Energy companies are strongly affected by uncertain price conditions, as they are exposed to the different risks from liberalized energy markets in combination with important and, to a large extent, irreversible investments. Price predictions, however, are usually expressed as point forecasts that give little guidance as to their accuracy, whereas, the planning process needs to take into account the entire probability distribution of future prices or at least intervals that have a pre-specified nominal coverage rate i.e. a given probability of containing the future prices. It is the aim of this paper to resume the prediction intervals (PIs) proposed by Williams & Goodman  , which anticipated bootstrap techniques, but was introduced at a time when its heavy computational demands slowed down practical applications.
A literature review in the field of short-term forecasting of electricity prices, reveals that limited research has addressed the issue of PIs. Misiorek et al. and Weron   were the first to consider interval forecasts. Other few exceptions are Wu et al., Nowotarski & Weron, and Bunn et al.    . However, interval forecasts remain an underdeveloped topic. See  . Under the very restrictive assumptions of independent and identically distributed Gaussian errors, PIs can be obtained by means of standard formulae. Since the distribution of forecast errors is unknown, they must be estimated to calculate interval bounds.
The main problem with assessing the reliability of price forecasts is that the magnitude of post-sample errors cannot be exactly evaluated until the prices are observed. In order to simulate such a situation, the time series under study can be split into two parts: the “training” period, which ignores a number of the most recent time points and the “validation” period, which comprises only the ignored time points. For the purpose of this study, we have not used the entire time series, but kept the very last time points untouched because they serve as a benchmark (target period) against which the quality of the PIs is to be judged. The training period is used to identify and estimate one of the large variety of electricity price models described in literature. Borovkova & Schmeck  observe that, despite the voluminous literature on modeling electricity prices, a clear “winner” model has not emerged. Here, we use a Box-Jenkins process.
Williams & Goodman  had the simple and ingenious idea of rolling ahead the training period and repeating the forecasting procedure until it is not possible to make any more multi-step predictions. The collection of multi-step-ahead errors forms the empirical distribution of the forecast errors for each lead-time. This allows us to estimate the quantiles necessary to compute the interval bounds. To implement the Williams & Goodman (WG) procedure, it is necessary to find a family of probability distributions having a member which gives a good fit to the empirical distribution of forecast errors. We will consider various density functions to be combined with SARIMAX processes for time series data concerning Italian electricity hourly zonal prices. Our aim is to find the most effective way of mixing WG procedure and SARMAX processes.
The organization of the paper is as follows. In the next section we address important aspects of data-preparation. Section 3 provides a brief review of the Box Jenkins approach to compute point forecasts. In Section 4, the Williams-Goodman procedure is discussed and in Section 5 it is combined with various density functions purportedly useful in describing the empirical forecast error distribution. In this same section is presented an application to Italian hourly zonal prices. Conclusions are drawn in Section 6.
2. Data Preparation
In this article we analyze data on hourly zonal prices traded at the day ahead Italian energy market. Because of transmission capacity constraints, Italy is partitioned into six zones: North, Centre-North, Centre-South, South, Sardinia and Sicily with a separate price for each zone. When there are no transmission congestions, arbitrage opportunities restrict the prices in each zone to be equal. See  . The hourly marginal prices can differ across zones because of transmission limits or because of dissimilar behavior of the consumers, but it is the same within a zone. The sizes of the zones are not equal. In fact, North (which includes the Italian regions of Val D’Aosta, Piemonte, Liguria, Lombardia, Trentino, Veneto, Friuli Venezia Giulia, Emilia Romagna) constitutes a large fraction of Italian national surface (40%) and its population (46%). The large islands of Sicily and Sardinia suffer from poor interconnections and frequent congestions. In the present article, we prefer to work with zonal data because they show far wilder randomness than the national price and are complex enough to challenge the statistical methods used for making predictions.
Data sets are freely accessible by the Italian independent system operator on http://www.mercatoelettrico.org/En/Tools/Accessodati.aspx?ReturnUrl=%2fEn%2fDownload%2fDatiStorici.aspx.
According to principles of decentralization and subsidiarity, creatively extended to long time series, we will treat each hour as a separate time series, such that 24 different models are estimated. All the time series go from 1 am on Monday, 7/1/2013 to 24 pm on Sunday, 26/2/2017 and hence cover a total of 24 hourly prices in 1148 days for six zones. We reconstructed the values corresponding to the changes of the daylight-saving time by the arithmetic average of the two neighboring hours, while the “doubled” values corresponding to the switch from the daylight-saving time, were replaced by the arithmetic mean of the two neighboring prices.
Time series of electricity prices display characteristics not frequently observed in other commodity markets. Pronounced daily, weekly, monthly and multi-monthly seasonal cycles; heteroskedasticity, mean reversion, and a high number of spikes (very sharp peaks or extremely deep valleys) in a short period of time. Knittel & Roberts  note that electricity prices also contain what they refer to as, an “inverse leverage effect”. Electricity price volatility tends to rise more so with positive shocks than negative shocks. Most of these characteristics stem from the fact that power cannot be economically stored and, consequently, accumulation and selling of stocks/inventories have reduced intervention potential to smooth supply or demand shocks across over time. Additionally, electricity markets face stringent distribution and transmission constraints.
To attenuate these effects, prices are log-transformed so that all upward or downward spikes are closer to the mean of the time series. The attenuation does not absolve us from trying to use more effective albeit more invasive treatment of aberrant prices. On the one hand, even if the removal of legitimate data points could be accepted as a permissible practice, the number of values suspectable of being anomalous is too large to justify their exclusion. Extreme price swings, in fact, need not be treated as enemies, because they are very significant for energy market participants. On the other hand, as noted by Fildes  , the choice of a forecasting method should not be dependent on such extremes, unless they contain information we cannot afford not to consider.
To deal with spike prices, we construct an artificial time series by decomposing the original time series into trend-cycle (expressed through orthogonal polynomials) and periodicities (expressed as a sum of harmonics with random phases). Deviations between observed and artificial prices outside the range: median plus/minus a factor of the median absolute deviation from the median, are considered anomalous residuals which may indicate abnormal price. These prices, are considered missing and replaced by a weighted average of the observed prices and the corresponding artificial prices. Although infrequent, negative or virtually zero prices do occur. These unusual prices can create problems with log transformation. So, prices less than one e/MWh are treated as missing values and imputed using the artificial time series.
3. Point Forecast
The generic time series is represented by where n is the number of observations, which, in this section, comprises the training and the validation periods (which, taken together, form the fit period). The index of the hour is suppressed, but it is understood that refers to a daily time series of one of the hours.
There is no general consensus at present on the best method to be used for electricity prices modeling. In this context, we apply Box-Jenkins forecasting method which has proven to be flexible enough to accommodate the electricity price behavior satisfactory. See   . We do not investigate other time series approaches which are potentially useful, but lie beyond the scope of the present study. For a recent comprehensive survey see  .
The general form of a sarimax model is
where is the price at the day t and is a white noise process with zero mean and finite variance . The symbol B represents the usual backward shift operator and and are polynomials in B
Some of the parameters may be zero or otherwise constrained, so that (2) could be a multiplicative seasonal model where
Expressions , , , are polynomials of order, respectively, p, P, q, Q and s indicates the length of the periodicity (seasonality). The same notation may be used to take into account multiple seasonality effects if necessary. Moreover, , . The notation indicates m variables observed at day t influencing the price of electricity; is a parameter measuring how the price is related to the j-th variable .
To keep the problem of estimating Equation (1) tractable, we use only deterministic exogenous variables so we know exactly what they will be at any future time (e.g. calendar variables, polynomials or sinusoids in time). The choice of known or non-stochastic regressors simplifies the inferential procedures, including estimation and testing of the parameters. This choice is also suggested by the fact that stochastic exogenous regressors, which must also be forecast, is one of the possible causes of inefficiency in prediction intervals. See  [Sect. 6.5]. According to the preparatory work for the present research, the most influential calendar variables are: days of the week, public holidays (official and religious) and daylight-saving time. Days immediately before and immediately after holidays are considered Saturdays and Mondays, respectively. Calendar effects are accounted for in the model by incorporating sets of dummy variables, where one of the categories is omitted to prevent complete collinearity. The dummy, or more precisely, binary variables in the process (1), precludes using the difference operators. It follows that, from now on the Box-Jenkins processes will be associated to the acronym SARMAX. The “burden of regular and seasonal non-stationarity” is placed entirely on the estimated parameters. In this sense, we require the roots of the polynomials and lie outside the unit circle, with no single root common to both polynomials. If this condition is satisfied then errors are stationary with finite variance.
The estimation of the parameters can be realized by optimizing the log-likelihood function of (1), provided that are known and errors are Gaussian random variables. Since we ignore the order of the polynomials, the estimation should be repeated for different values of . If
then there are distinct processes to be explored for each time series. We execute the search of the best process in automatic mode, over a limited set of distinct variations by using the function auto.arima() of the R package forecast with the option of a stepwise search to reduce the high computational cost of brute force search.
A common index to compare rival models is the bias-corrected version of the Akaike criterion
where denotes the estimated error variance of the candidate process. The process associated with the smallest AICC is presumed to be the best process. Let be the number of prices to be foreseen (lead-time). The selected process serves to compute, standing at time n, forecast of the price at day which are optimum in the sense of quadratic loss, conditional on an information set , i.e. . It turns out that, under reasonably weak conditions, the optimal forecast is the expected value of the series being forecast, conditional on available information. See  (p. 172).
Forecasting the regression term in (1) does not present particular difficulties because of the perfectly predictable nature of the regressors. The future values of the stochastic process term can be computed by using the infinite moving-average representation of the optimal process
where (this constraint is equivalent to the requirement of roots outside the unit circle). The coefficients in (5) are functions of the parameters in (2) and can be easily obtained by recursive equations. See  . In practice, however, the parameters of (2) have to be estimated, and it is customary to substitute estimated values into all the formulae.
4. Prediction Intervals
Short-term point forecasts cannot reflect all the uncertainties in the price of energy. In this regard, it is far more interesting to have information on how reliable the extent of the prediction is. In short, given a time series of n prices , we seek forecast limits such that the probability is that lies in
where is the price (/kWh) at a given hour of k days after day n and n is the last period at which a price is available. The point forecast is obtained by identifying and estimating a SARMAX process to the fit period (i.e. training plus validation periods). is the quantile of order of the distribution of the forecast error at origin n and lead-time k. If the hypothesis of Gaussianity is accepted for each k, then PIs can be derived from the standard formulae given by Box & Jenkins 
where is the upper point of the Gaussian distribution with zero mean and variance one. Moreover,
PIs in (7), typically called Box-Jenkins prediction intervals (BJ PIs), are the most commonly used even in the cases there are no specific reasons to assume a Gaussian distribution of the errors.  [Sect. 7.7] illustrates various possible reasons why PIs in (7) are inadequate to encompass the required proportion of future prices and the lack of Gaussianity of the forecast errors is indicated as one of the causes. See also  .
4.1. Williams Goodman Procedure
To simulate distribution of forecast errors, the time series is split into two parts: the “training” period and the “validation” period. As a preliminary, we choose a window size, (the number of consecutive daily prices) which, together with the maximum lead-time L, establish the complexity of the Williams Goodman (WG) procedure.
Initially, the training period contains prices for day 1 through whereas, the prices from to act as the first validation period. The class of SARMAX model discussed in Section 3 is fit to the training time series to find the best process which minimizes the AICC criterion (4). The selected process is then used to calculate the L-step-ahead point forecasts: at the time origin . The post-sample forecast errors are obtained from difference with the corresponding values of the validation period:
Note that, in this case, is a real price and not a random quantity.
In the successive step, a block of contiguous prices is dropped from the start of the training period and, simultaneously, contiguous prices from the start of the validation period is shifted back to the end of the training period so that the second window contains prices for day through day . The second validation period includes prices from to due to the inclusion of the next block of prices taken sequentially from the time points of the validation period not yet processed. The same class of models as in the initial step is fitted to the new training period, the new L-step-ahead forecasts calculated and the corresponding post-sample errors obtained at the time origin as .
The procedure is iterated until the last training period and the last validation period achieve the end n of the fit time series. Overall, the procedure forms distinct sequences of L-step-ahead forecast prices and post-sample forecast errors. We can arrange errors as a matrix.
Rows correspond to different time origins and columns to different lead-times. If the forecast error distributions are the same, then column can be intended as a sample of size r of the forecast errors that would have been made by the selected SARMAX process, at lead-time k across horizon origins .
The construction of PIs requires knowledge of the quantiles of the forecast error distribution, which are typically unknown and have to be estimated. An obvious way to generate PIs is to assume k-step-ahead forecast errors follow a continuous distribution function. If the fitting is satisfactory, the quantiles of the distribution can be estimated and then prediction bounds determined for each lead-time.
Chatfield  [Sect. 7.5.5], notice that, although the WG procedure is attractive in principle, it seems to have been underutilized, not only because of the lack of time series long enough to give credibility to the fit of empirical distributions, but also because of the heavy computational requirements involved. Of course, the length of the time series is not a problem with time series collected in the electricity market and analyzed in the present study. In addition, the effort required to implement WG for time series of moderate length (1000 - 1500 time points) is compatible with the hardware/software resources generally at disposal. An R script is available from the authors upon request.
4.2. Density Selection
In the framework of electricity price forecasting, it might be reasonably argued that prices are not Gaussian (see,   ) but it is not clear what is precisely their distribution. In this subsection, however, we discuss the distributional properties of post-sample forecast errors whose behavior cannot be automatically deduced with certainty from observed values and/or in-sample forecast errors. If the empirical situation does not suggest an obvious choice, one can be selected among myriad examples of probability density functions (pdfs). Williams & Goodman  adapted a gamma density to the absolute value of the forecast errors. Isengildina-Massa et al.  found that forecast errors from the data set used in their study most often followed a logistic distribution. Bordignon & Lisi  and Lee & Scholtes  propose the Gaussian distribution. We choose the mathematical function in the first row of Table 1 after a long and complicated search for a powerful and versatile curve. Obviously, we are aware that, trying many densities and keeping the “best fitting” one, does not guarantee that another model will not look better than those we have already seen.
In all the densities, controls the location of the distribution; affects the scale, and are shape parameters. The densities are referred to
Table 1. Density functions for post-sample forecast errors.
. The gamma density is fit to the absolute value of post-sample errors and hence . The system proposed by Johnson  contains three classes of distributions which are based on the transformation where z is a standard Gaussian random variable and g (with derivative ) has three possible forms
In using (10), a first problem to be solved is to determine which of the three families should be used and, once the class is selected, the next problem to be solved is to estimate the parameters. In both problems, we follow the technique proposed by Wheeler  as implemented in package in the package SuppDists with quantiles .
The column headed “R package” refers to the package used for parameter estimation. The notation “stats” indicates standard computational algorithms. The last column of Table 1 reports the technique of estimation of the parameters: mle (maximum likelihood), qme (quantile method), mme (method of moments).
The usual strategy behind fitting a given distribution to data is to identify the type of density curve and estimate the parameters that give the highest probability of producing the observed values. Instead, we follow an indirect approach: we compare the different density curves by testing how accurately the PIs generated by a SARMAX process, in tandem with Williams-Goodman method, capture the true prices.
5. Forecasting Accuracy
Let us consider the matrix of estimated forecast errors discussed in the preceding section. For each lead-time we fit the distributions shown in Table 1. Let be the α-th estimated quantile of the v-th distribution, . The quantiles are used to derive parametric prediction limits
The means and standard deviations are computed over the post-sample errors
Notice that the mean of post-sample errors is not necessarily zero.
To assess the performance of the various PIs, we compare the prediction interval actual coverage (PIAC) to . The PIAC is measured by counting the number of true hourly prices of the target period enclosed in the bounds (11)
If the PIs are accurate, then . All other things being equal, narrow PIs are desirable as they reduce the uncertainty associated with forecast-based decision making. However, there is a trade-off between PI widths and PIAC: the wider the PI, the higher the corresponding PIAC and hence the greater is the accuracy of predictions, at least to a certain extent, because very wide PIs are not practically useful. On the other hand, very sharp PIs with a low coverage probability are useless as well. It is necessary to introduce, in this connection, a scoring rule for addressing the sharpness of PIs. We use a score function of the form proposed by Winkler & Murphy 
The use of ratios facilitates comparability across price levels. The symbol represents an indicator function taking one if the argument is true and zero otherwise. The first addend in (14) reflects a cost associated with the width of the interval. The cost decreases as increases, to compensate the tendency of the bounds to be broader as the confidence level increases. The other two addends penalize PIs if the target is outside the interval. The penalty increases with increasing distance from the nearest interval endpoints. The average of (14) across time points provides an indication of the sharpness of PIs
Criteria (13) and (15) should be judged keeping in mind the stochastic behavior of the electricity prices. Here, we have a potentially severe problem. Price peaks and valleys have been smoothed for the training and validation periods, but the same has not been done for the target period. These prices, in fact, are left as they are observed, to simulate real conditions. Spike prices, however, are recurring events and, therefore, it would not be surprising to find some of them in the target time series. Our SARMAX processes, being developed within a cleaned-up data set, have hardly any possibility to predict satisfactorily all, or at least a good part, of the outliers. Remaining outliers imply poor prediction intervals in practice. Further research is required to formulate a model which is not only generally enough to merge Box-Jenkins processes, WG prediction intervals and spike prices, but also it is numerically tractable to provide a quantitative description of the complex patterns of electricity market time series.
To this end, we analyze different time series, one for each hour of the day and each zone of the Italian electricity market. All the daily time series are long 1148 days, but the last three weeks ( ) are reserved for assessing the predictive accuracy of the intervals. Thus, only the first 1127 days are used for estimation and validation of SARMAX models. The size of the rolling window is fixed at (15%) which leads to samples of 21-step ahead forecasts. The search of the SARMAX processes is conducted within the bounds , which include 81 different processes. Each process is combined with the WG procedure applied to any one of the density functions in Table 1. For completeness, we include the BJ PIs described in (7) and Tchebycheff PIs
At this point it is necessary to remind that the point estimates are obtained for log-prices , which have to be transformed back to the original scale to give forecasts for . A simply way is to transform the bounds obtained for by applying a simple fractional multiplier
where is the correction factor proposed by [Baskerville, 1972] and is the standard deviation introduced in (8). Expression (17) is a genuine PI since is a monotone function of , but they do not have necessarily the same structure. For example, the interval (17) is asymmetrical even though the interval in the log scale is symmetrical. Furthermore, the anti-logarithms of forecasts are biased. See  .
Table 2 shows the results of PIAC and MS at . The row
Table 2. Quality of PiIs.
entitled “Frct” denotes the percentage of times, out of the 144 cases studied, the corresponding density determined PIs with the lowest magnitude among all the PIs associated with an actual coverage rate greater than or equal to . The symbol “-” indicates that the corresponding distribution has never taken the first place in the two rankings of forecast accuracy used in this study.
On a first general examination, we note the consistent behavior between actual coverage rate (PIAC) and mean relative scores (MS), with the latter decreasing as the former increases. Naturally, this is a confirmation of the expected behavior of the score function (14). Tchebycheff intervals show, perhaps not surprisingly, the largest widths. Box-Jenkins prediction intervals (BJ PIs) appear to be the most conservative approach, i.e. it yields largest coverage rates, but with prevalently smaller widths than the Tchebycheff PIs. The substantial reliability in performance of the WG procedure based on Johnson’s system, gamma, logistic and Gaussian distributions is due to actual coverage rates which are decidedly lower than the corresponding nominal coverage rates than BJ and Tchebycheff PIs. But, above all, the formers have a much smaller sharpness than the latters at all confidence levels. The distributions nested within Johnson’s system come more frequently on top in terms of actual coverage probability and sharpness of the intervals and hence can be considered the optimal probability density within the experimental set up of our study.
Prediction intervals (PIs) are random sets designed to contain a future value with a given probability. The principal reason for constructing them is to provide an indication of the reliability of point forecasts avoiding a complete description of the probability distribution of the uncertainty associated with a prediction. Box-Jenkins or BJ PIs (the procedure in common use currently) assume Gaussian errors, known parameters and intervals are centered about the conditional expectation. Consequently, BJ PIs cannot take into account the variability due to parameter estimation and behave poorly when the errors are not Gaussian. Our findings confirm these observations.
The primary concern in this paper is with the Williams & Goodman  (WG) procedure and the gain of accuracy brought by WG PIs in comparison to conventional BJ PIs. Our findings show point forecasts for day-ahead hourly prices in Italian wholesale electricity market may be of greater utility if accompanied by PIs calculated using WG centered around a member of the Johnson system of density functions. This mix offers PIs having a coverage rate greater than the nominal rate, but, what is more important is that this desirable conservativeness is not at the expense of widening the intervals, which instead are admirably sharp. The procedure has proven to be very competitive on two other procedures: Box-Jenkins and Tchebycheff PIs.
The authors would like to thank a referee for helpful comments, which lead to an improvement of the paper.