This work applies non-stationary random processes to resilience of power distribution under severe weather. Power distribution, the edge of the energy infrastructure, is susceptible to external hazards from severe weather. Large-scale power failures often occur, resulting in millions of people without electricity for days. However, the problem of large-scale power failure, recovery and resilience has not been formulated rigorously nor studied systematically. This work studies the resilience of power distribution from three aspects. First, we derive non-stationary random processes to model large-scale failures and recoveries. Transient Little’s Law then provides a simple approximation of the entire life cycle of failure and recovery through a queue at the network-level. Second, we define time-varying resilience based on the non-stationary model. The resilience metric characterizes the ability of power distribution to remain operational and recover rapidly upon failures. Third, we apply the non-stationary model and the resilience metric to large-scale power failures caused by Hurricane Ike. We use the real data from the electric grid to learn time-varying model parameters and the resilience metric. Our results show non-stationary evolution of failure rates and recovery times, and how the network resilience deviates from that of normal operation during the hurricane.
Received 11 May 2015; accepted 26 February 2016; published 29 February 2016
The power grid is a vast interconnected network that delivers electricity to customers. Power distribution system lies at the edge of the power grid  . Power distribution provides medium and low voltages to residences and organizations. Distribution networks consist of a large numbers and diverse types of components, such as substations, power lines, poles, feeders, and transformers. Many such components are distributed in the open and exposed to external hazards such as hurricanes, derechoes, and ice storms  . About 90% of total failures occurred at power distribution  . Thus power distribution is particularly susceptible to external disruptions from severe weather  .
A fundamental research issue pertaining to this real problem is the resilience of power distribution to large- scale external disruptions. Here, resilience corresponds to the ability of power distribution to reduce failures and recover rapidly when failed  . Until now, tremendous efforts have been directed to resilience of the core that consists of the major power generation and transmission of high voltages  -  . For example, cascading failures at transmission networks have been studied widely   -  . However, the majority of the failures occurred during severe storms are often at power distribution rather than transmission networks  . As the demand for energy grows, the edge of the grid becomes more and more important. For example, a utility (dis- tribution system operator) can serve millions of customers in America. Damages from such hazards as severe storms to power distribution can profoundly impact a large number of users. Hence, the resilience of power distribution requires significant study. Resilience of power distribution under severe weather poses unique challenges:
・ Randomness and dynamics (i.e., non-stationarity) of failures and recoveries,
・ Time-varying resilience at the network level,
・ Estimation of non-stationarity and the resilience using real data from the electric grid.
A pertinent first step is to model large-scale failures and recoveries. Such a model is a prerequisite for deriving a resilience metric at the network level. The metric needs to reflect the intrinsic characteristics of large- scale failures and recoveries. Severe weather disruptions such as hurricanes evolve randomly and dynamically. So do large-scale failure and recovery at power distribution. For example, failures occur and recover depending on random factors such as the intensity of a storm and dynamic allocations of repair crews. These factors vary with time. Hence, it is appropriate to model based on non-stationary random processes.
Prior approaches account for randomness of failures but rarely dynamics    -  . Models for failures are widely studied in computer-communication, e.g., finite state Markov process  and reliability of other multi-component systems   . These models are stochastic but assume stationary properties or distributions. Stochastic models and random processes are also studied for other aspects of power distribution systems (e.g., forecasting and modeling renewable resources and power generations  -  , stability of power flows  , demand response  , maintenance process  , and degradation process  ). We develop an approach to learn from data for non-stationary stochastic processes on weather-induced failures and recoveries  .
Another challenge is how to quantify the resilience of power distribution. Resilience in this work measures the performance of power distribution during severe weather. In principle, such a resilience metric should manifest the difference between the performance in severe weather and normal operations  . Various reliability metrics are developed and widely used, including the IEEE standard indices for power systems  . The reliability metrics are for daily operations where disruptive events (e.g., severe weather) are excluded  . It is thus infeasible for the reliability measures to be used to study resilience that focuses on power failures and recoveries induced by severe weather events. During a severe weather event such as a hurricane, large-scale failures and recoveries can occur, which is significantly different from that in daily operations. Resilience metrics are thus open and much needed for both industry and communities as advocated by the recent reports in the nation   . Several resilience measures are developed by the prior works, including a static metric of fragility  and dynamic metrics of functionality or quality  . How resilience evolves with time is modeled by brute force   or averaging over time   . In principle, resilience as a performance measure at the system-level is lacking. Such a resilience metric needs to be derived from a system model of failures and recoveries.
A third challenge is that unknown parameters of non-stationary models and a resilience metric need to be estimated from real data  . Prior works studied the power failures using historical data from previous storms   -  . As the models were static, the learned parameters provided the static characteristics of power failures. Dynamic characteristics of power failures have been studied little from real data. Recovery is rarely studied using real data. A challenge is to learn from one external disruption such as a hurricane, as real data is rare and often unavailable from many large-scale weather disruptions.
The contribution of this work is to address the above three challenges, which are:
・ To develop a model based on non-stationary random processes,
・ To derive a dynamic resilience metric based on the model,
・ To learn time-varying model parameters and resilience metric using large-scale real data.
We first formulate, from bottom up, an entire life cycle of large-scale failure and recovery. The problem formulation begins at the finest level of network nodes based on temporal-spatial stochastic processes. Since each external disruption results in one snapshot of nodal states (failed and normal), information from one weather event is insufficient for completely specifying a temporal-spatial model  . Thus we focus on temporal models by aggregating spatial variables. Such an aggregation enables this work to focus on the non-stationary nature of failure and recovery at a moderate time scale, e.g., minutes and beyond. Such a time scale concurs with that of an evolution of a severe storm  . Failure and self-recovery that occur in seconds or shorter within power distribution systems are studied in other contexts  .
A resulting temporal model can be approximated by a queue  . The arrival process to the queue characterizes failures with a general time-varying distribution. The service time of the queue corresponds to the delay for failures to recover, and has a general time-varying distribution. means that it is possible for failures to recover immediately. Note that such a queue is easily extensible to multiple queues at different geo-locations  without loss of generality. Such a queuing model is an approximation of recovery in practice as first-come-first-service policy is assumed. When recovery is conducted with certain optimality, intuitively such a model services provides a “lower bound” for the performance of restoration.
We study an analytically tractable case of queue through the Transient Little’s Law  , which characterizes an entire non-stationary life-cycle of large-scale failures. The importance of Transient Little’s Law is that two simple quantities, failure rate and probability distribution of failure duration, completely quantify the dynamic model to the first moments. This simplifies definition of dynamic resilience and estimation of models parameters from data. We define a dynamic resilience metric that includes not only resistance to failures but also fast recovery as one additional attribute. Such a dynamic metric shows the time-evolution of network resilience during an external disruption. Finally, the non-stationary model and the resilience metric are applied to a real life example of large scale failures of power distribution. The failures occurred during Hurricane Ike in 2008. Real data from power distribution is used to study failure and recovery processes as well as resilience.
The rest of the paper is organized as follows. Section 2 provides background knowledge and an example of large-scale failures at power distribution. Section 3 develops a problem formulation from nodes (components) to queue at the network level. Transient Little’s Law is applied to obtain the non-stationary failure and recovery rates. Section 4 defines resilience based on the non-stationary model. Section 5 estimates pertinent model parameters and the resilience metric using large-scale real data. Section 6 discusses our findings and concludes the paper.
2. Background and Example
To provide intuition for modeling on failures and recoveries induced by severe weather, we begin with two examples.
2.1. Synthetic Example
The first example illustrates how failures can be induced by severe weather. The example is on a small section of a power distribution system drawn from  and shown in Figure 1(a). The section consists of two sources and, seven components and five loads. A component can be a transformer, a feeder, a pole, or a circuit. Links correspond to power lines. Assume either a source or a component or a link can fail during a hurricane. Assume primary source is used in normal operations; and back-up source is used if fails. The following scenarios can occur:
a) If any of the components fails independently due to an external disruption, and if both the primary and the back-up sources can be in operation, there is no electricity supply to the loads that are connected to the failed components. Thus, the components and the loads experience independent failures. Such independent failures can be caused directly by external factors at a time scale of a minute or beyond.
Figure 1. (a) A section in a distribution system. (b) Empirical distribution of failure time and duration. Failures and their durations are plotted at the time scale of hours for ease of illustration. () are the intervals used for estimating an non-stationary distribution of failure durations (see Section 5).
b) If and both fail due to an external disruption, there is no electricity supply to any loads. Hence, the components and loads experience dependent failures that can occur at the same time. The scenario of dependent failures also applies to failed substations and other components upstream in a radial topology of power distribution that cause loss of electricity at other nodes. Such dependent failures often occur within a time scale of seconds. Note that such dependent failures do not exhibit cascading effects that occurred in transmission networks due to a radial topology of power distribution.
c) Recovery depends on the types of failures, restoration schemes, as well as the terrene conditions. For example, if either source or component 1 or component 2 fails, the electricity supply to all loads can be recovered almost immediately when is in operation. Generalizing this scenario, self-recovery and automated reconfiguration built in power distribution usually operate within seconds  . However, if a power line to a load fails because of a fallen tree, the recovery may require dispatching crews to the field, and thus be prolonged depending on environmental constraints, preparedness, and resources.
In summary, failures and self-recoveries in a small time-scale of seconds depend on detailed topology and self-recovery schemes. Failure and recovery at a larger time scale of minutes and beyond are often due to external disruptions that evolve dynamically and randomly.
2.2. Real Data
The second example illustrates non-stationary failures and recoveries. The example uses real data on an operational power distribution system during Hurricane Ike. Hurricane Ike occurred in 2008 and affected more than 2 million customers at densely populated areas in Texas and Louisiana. Figure 1(b) shows a histogram on failure occurrence time and duration at the distribution network before, during and after the hurricane (see Section 5 for details on the data). The histogram demonstrates the non-stationarity of the power failures and recoveries during the hurricane:
a) Failure occurrence was time-varying and random. More failures occurred during the hurricane than those that occurred before and after.
b) Recovery time was also time-varying and random. Recovery time was different for failures occurred at different time. For example, more failures occurring during the hurricane recovered slowly than those that occurred before and after.
As the result, the probability distributions of failure-occurrence and failure duration vary with time in minutes and hours. Note that information on root causes of failures and recoveries is unavailable, which is beyond the scope of this work.
3. Stochastic Model
We now formulate time evolution of large-scale failure and recovery as a non-stationary random process. We begin with detailed information on nodal states (failure and normal). We then aggregate the spatial variables of nodes to obtain the temporal evolution of failure and recovery of an entire network.
3.1. Failure and Recovery Probability
A spatial-temporal random process provides theoretical basis for modeling large-scale failures at the finest scale of nodes (component). The shorthand notation i is used to specify both the index of a node and its corresponding geo-location, where for a power distribution network with n nodes. The temporal variable is continuous time t.
Let be the state of the i-th node at, where. Two states are considered for a node. if the i-th node is in failure, e.g., there is no electricity supply at location i and time t. if the node is in normal operations. Here, two-state rather than multi-state failure models are considered for simplicity. A network state is then characterized by. For analytical tractability, failures and recoveries are assumed to have been detected already  . Hence state estimation from power flows  is out of the scope of this work.
Failures caused by external disruptions exhibit randomness. Whether and when a node fails is random. Whether and when a failed node recovers is also random. Given time, characterizes the probability that node i is in the failure state at, where is a small time duration. Assume a node changes state, i.e., from failure to normal and vice versa. Then for the i-th node, ,
Equation (1) models an individual node in a network. The model includes Markov temporal dependence for nodal states which is a simple assumption for state transitions. Such a model can be applied to a heterogeneous grid where nodes experience different failure and recovery processes in general. There are no assumptions on an underlying network topology nor independence/dependence of failures. Such n equations for n nodes together form a spatial-temporal model for a network.
Each severe weather event generates one snapshot of network states. Information available on failures and recoveries is often from one or a few events. Such information is insufficient for specifying the spatial-temporal model. Hence, we derive a temporal model in this work by considering an entire network as a whole.
3.2. Temporal Process
Our temporal model aggregates spatial variables from Equation (1),
The probability can be further related to an indicator function, e.g.,. We now define a temporal process as follows.
Definition 1. Let be the number of nodes in the failure state at time t. is a temporal random process, where
is an indicator function, i.e., if event A occurs, and, otherwise.
Let represent an increment of the number of failures in interval. Combining Equations 2 and 3, the expected increment of failures at time t is
An increment for the total number of nodes in a failure state can result from either newly failed or newly recovered nodes. To further characterize the temporal process, we define a failure process and a recovery process respectively.
Definition 2. Failure process is the number of failures that occur up to time t.
Now we assume that failure is a counting process with independent increments (given weather conditions): For any, increments ‘s are independent. Such independence assumption results naturally from randomly occurring failures given weather conditions. Here the assumption holds at the time scale of minutes or hours at which severe weather (e.g., high-speed wind) evolves. Dependent failures can occur at smaller time scale of seconds, which is due to a structure of power distribution  . This is beyond the scope of here and will be addressed in a subsequent work. Hence the time-scale of failure processes is assumed to be in minutes here.
Definition 3. Recovery process is the number of recoveries that occur up to time t.
Assume is sufficiently small so that at most one failure occurs during. Then,
Similarly, assume that, at most one recovery occurs during for a sufficiently small. Let be an increment of the number of recoveries. Then,
Here recovery time (or failure duration) is also assumed at the time scale of minutes. Equation (2) can be rewritten as. Furthermore, assume at, , , and. Aggregating increments from 0 to t, we have. This is intuitive that the number of failures equals to the difference between the total failures and the recoveries occurred so far.
Hence, the expected number of nodes in the failure state equals the difference between the expected failures and the expected recoveries. The time-scale of a minute enables this work to focus on modeling failures that are induced by external disruptions and the recoveries that can not be accomplished by instant self-healing schemes. The aggregation conceals spatial variables  , network topology  and automated reconfiguration that are not discussed in this work.
3.3. Dynamic Queuing Model and Transient Little’s Law
Failure and recovery can be further related through a time-varying queue   as shown in Figure 2.
1) The arrival process to the queue corresponds to the number of failures with independent increments. The increments have a general time-varying distribution.
2) A failure that occurs in experiences delay before recovery. The delay corresponds to failure duration and has a general time-varying probability density function.
3) The departure process of the queue corresponds to the number of recoveries. implies an infinite number of servers for repair. This means that it is possible for recovery to occur right after failure.
A queue is applied here for the first time to characterize the temporal characteristics of a non-stationary joint failure-recovery process. Analytical expressions of departure processes are often intractable for general arrival processes. But the expected number of failures and recoveries, i.e., the first moments of the arrival and departure processes, can be quantified in a simple form through Transient Little’s Law  . In fact,
Figure 2. Application of queue and transient little’s law.
Transient Little’s Law provides an analytically tractable case of queue  .
Theorem 1. Transient Little’s Law 
Consider as an independent counting process. Let be the delay for an arrival in, where. Assume that an arrival of an increment does not affect the delay of previously arrived increments. Then the expected number of arrivals in the queue at time t is
where is the arrival rate,
Consider an increment of arrivals as new failures, an arrival rate as a failure rate, a delay as a failure duration, and departures as recoveries. Assume that recoveries occur following first-in-first-out (FIFO) policy. Transient Little’s Law can then be directly applied to our problem. The theorem has an intuitive explanation: is the average number of arrivals (failures) in, where is sufficiently small. is the probability that an arrival (failure) at does not recover until time t. Hence, is the total average number of failures that do not recover until time t. The mathematical proof of the theorem can be found in  .
Define recovery rate as the expected number of new recoveries per unit time at time t,
Applying Transient Little’s Law, the recovery rate can be related to a failure rate and a recovery time distribution by the corollary below.
Corollary 1. Let be an independent (failure) increment process with a rate function. Let be a failure duration with a conditional probability density function, where,. Then recovery rate satisfies
The proof of the corollary is in Appendix 1. In summary, two pertinent quantities completely determine the expected number of failures and recoveries: Failure rate and probability density function of failure duration given failure occurrence time.
We now derive a resilience metric using the pertinent parameters for an entire life cycle of non-stationary failure and recovery. While resilience can be characterized from multiple dimensions  , the infrastructure of power distribution is where failures occur. Hence we quantify the so-called (system) resilience by characterizing failures and recoveries of all nodes in a distribution network.
4.1. Infant and Aging Recovery
For an non-stationary recovery process, a failure duration depends on when the failure occurs (Figure 1). Given threshold, the conditional probability that duration is bounded by for failures that occur at time t is
When is sufficiently small, this probability characterizes rapid recovery referred to as infant recovery. This terminology is borrowed from infant mortality in survivability analysis  . Infant recovery is a desirable characteristic of power distribution. In contrast, slow recovery is referred to as aging recovery that is analogous to aging mortality  . Obviously, aging recovery is undesirable.
Definition 4. Infant and aging recovery
Let be a threshold value. If a node remains in failure for a duration less than, a recovery is an infant recovery. Otherwise, the recovery is an aging recovery. Infant recovery is characterized by. Aging recovery is characterized by.
Note here is a function of failure occurrence time. Hence, in general, a recovery process is non-stationary.
4.2. Dynamic Resilience Metric
As failure and recovery processes are dynamic, a resilience metric should be dynamic also. Furthermore, how resilience varies with time should result from the dynamic model of failure-recovery processes. Following such a principle, we define resilience from bottom-up, starting with one node. Probability that node i is in normal operations characterizes the ability to resist to failures at time t. Probability of infant recovery characterizes the ability of the node to quickly recover. Combining these two abilities, we define resilience as follows.
Definition 5. Resilience of a node
Given threshold value on failure durations, resilience for node i is the probability that the node is either functioning or exhibiting infant recovery, where
Aggregating the resilience of nodes over an entire network, (system) resilience is the expected percentage of nodes that are either functioning or recovering within upon failures.
Definition 6. Resilience of a network
Given threshold value on failure durations, resilience of a network is
Hence, aggregating over spatial variables, network topology and automated reconfiguration, the resilience of a network is an average resilience of all network nodes:
exhibits the following properties:
1) Resilience is a property of a distribution network as a whole to survive large-scale external disruptions.
2) Resilience is a function of time that reflects temporal evolutions of failures and recoveries in a network.
3) Resilience shows the ability of a distribution network to resist failures and recover rapidly.
4) Resilience depends on threshold on failure durations. The failures that can recover within are considered tolerable in terms of the resilience metric. When, the resilience metric offers no tolerance to delays in recovery, and only characterize the ability of resisting to failures.
4.3. Resilience Parameters
The resilience metric can be characterized by the parameters of the model, i.e., non-stationary random processes in Section 1. In particular, the resilience metric (Equation (13)) can be represented through a simple expressions owing to Transient Little’s Law,
The second term corresponds to the aging recoveries at time t. Let and is the conditional cumulative distribution function of duration. The resilience can also be viewed as one minus the expected percentage of nodes in aging recovery.
The above expression shows that given threshold, two parameters and together determine the system resilience. A smaller failure rate results in more functioning network nodes and thus a larger resilience. A higher percentage of infant recovery results in a fewer aging recoveries and thus a larger resilience. Our model is determined by these two time-varying parameters to the first moments (see Section 1). So is the resilience metric.
4.4. Resilience of Non-Homogeneous Poisson Processes
A special case of resilience is when the failure process is a Non-Homogeneous Poisson Process (NHPP). As a commonly-used failure process  , a non-homogeneous Poisson Process has independent increments and can be completely determined by time-varying rate function,
When a failure process is an NHPP, a queue reduces to a queue. represents a Poisson failure process with a time-varying rate. Furthermore, a recovery process is also an NHPP  . When, the recovery process becomes stationary. model reduces to queue developed in  .
Threshold is a pertinent parameter that measures the degree of resilience in terms of infant recovery. can be determined by service requirements. For example, hours is used by Distribution System Operators (power utilities) when it is acceptable to restore a failure within 24 hours after severe weather strikes.
can characterize a special value of as follows. Consider a scenario where failures occur suddenly and intensely due to severe storm, e.g., from a hurricane. That is, characterizes impulse-like failures that increases sharply from a small value, , at time 0 (normal operation) to a large value, , in short duration. The failure rate can then be written as, where is the unit step function and. Furthermore, consider a special case when infant recovery dominates, i.e., for. This implies that all failures recover within duration. The expected number of nodes in failures at time t is
where. The terms include the remnants when is approx- imated using the first-moment. This expression shows that infant recoveries occur within after failures erupt. Here is assumed to be larger than the duration of failure process for simplicity. In contrast, when aging recovery dominates, for. The expected number of nodes in failure at time t is
where. This expression shows that aging recoveries do not start until delaying from the eruption of failures.
In general, a failure-recovery process can be regarded as a combination of these two special cases. At time, when a severe storm starts to impact a power grid, the failure process dominates, and increases rapidly. The recovery process starts after occurrences of failures, and gradually dominates. When parts of the failures recover within time duration (i.e., as infant recoveries), there can be a sharp decrease in. The remaining failures are restored with longer delays as aging recoveries. Therefore after the sharp decrease, may decrease at a slower rate. Following these scenarios, we expect to see a sharp decrease after a sharp increase in the temporal curve of. The time delay between the two sharp changes can be chosen as, where
5. Numerate Results of Real Data
We apply the non-stationary failure-recovery processes to a real-life example of large-scale failures caused by a hurricane. Our focus is on estimating the three pertinent quantities, and threshold using real data. These parameters are then used to measure dynamic resilience of an operational power distribution network.
5.1. Real Data and Processing
Hurricane Ike was one of the strongest hurricanes that occurred in 2008. Ike caused large scale power failures, resulted in more than two million customers without electricity, and was considered by many as the second costliest Atlantic hurricane of all time   .
Reported by the National Hurricane Center  , the storm started to cause power outages across the onshore areas in Louisiana and Texas on September 12, 2008 prior to the landfall. Ike then made the landfall at Galveston, Texas at 2:10 a.m. Central Daylight Time (CDT), September 13, 2008, causing strong winds, flooding, and heavy rains across Texas. The hurricane weakened to a tropical storm at 1:00 p.m. September 13 and passed Texas by 2:00 a.m. September 14.
Widespread power failures were reported across Louisiana and Texas starting September 12  . A major utility collected data on power failures from more than ten counties. The outages included various component failures in the distribution network such as failed circuits, fallen trees/poles, and non-operational substations. The raw data set consists of 5152 samples. Each sample consists of the failure occurrence time () and duration () of a component (i) in a distribution network. The accuracy for time t is one minute. In the data set, 2005 samples were from 7 a.m. September 12 to 4 a.m. September 14, during which Hurricane Ike made the landfall in Texas. These failures are considered to be resulting from the hurricane, and form our data set.
Among the 2005 samples, there are groups of failures that occurred within a minute. Failures within a group are considered as dependent and aggregated as one failed entity. Each group has a unique failure occurrence time and duration. There were also groups of failures that recovered within a minute that are combined as one entity of recovery also. After preprocessing the groups as one entity, the resulting data set had 465 failed entities at the time scale of a minute. Two outliers with negative failure duration were further removed. The remaining 463 failed entities from 7 am September 12 to 4 am September 14 form data set.
The 463 samples are then randomly partitioned into a training set of 333 samples and a test set of 130 samples. The training set is used to learn parameters. The test set is used for validating the model and the parameters.
5.2. Empirical Failure Process
We now use the data set to study the empirical processes, , and which are sample means of the expectations, , and, respectively.
5.2.1. Estimating Failure Rate
First, we use the training set to determine failure rate. We estimate the empirical rate function through a
simple moving average  :, where hours. We use the training set to
estimate the failure rate and use the testing set to validate the estimation. Figure 3(a) shows the estimated failure rate and the estimation error with the 95% confidence interval. The failure rate increased and decreased in accordance with the evolution of the hurricanes. Before 7 p.m. September 12, when the hurricane was yet to arrive, the rate was less than 5 new failures per hour. Then the failure rate increased rapidly and reached the maximum value of 50 new failures per hour. The peak time of the failure rate coincided with the time of the landfall at 2:10 CDT September 13  . After that, the failure rate reduced to a small value of less than 5 new failures per hour. As the failure rate was time varying, the failure process was non-stationary.
5.2.2. Non-Homogeneous Poisson Model
We now consider hypothesis that these 463 failure occurrences are governed by a non-homogeneous
Figure 3. Estimated failure rate of the distribution network during hurricane ike. The upper and lower bounds shows the estimation error at the 95% confidence interval. (b) Quantile-Quantile plot on occurrence times of power failures, versus non-homogeneous poisson process. The theoretical quantiles corresponds to the NHPP with rate function.
Poisson process. If is true, the assumption of our model is validated that the failure occurrences have an independent increments.
We perform Pearson’s test  on hypothesis that the empirical failure process follows a non- homogeneous Poisson Process with rate function. The test focuses on two aspects: (a) the independence of the failures occurred at the large time scale of tens of minutes, and (b) the empirical rate function is sufficient for characterizing the failure process. Detailed procedure of the test is in Appendix 2. The time duration is divided into 400 intervals between 7 a.m. September 12 and 4 p.m. September 14. In each interval, the number of new failures is compared with its expectation computed from. The sum of the square errors from all of the intervals results in a chi-square statistic. The chi-square statistic is, with a degree of freedom of 2. Given a confidence level at 95%, the threshold value is obtained as, where. The chi-square statistic obtained from the data is below the threshold. Hence hypothesis is not rejected.
However, not rejecting is insufficient for accepting the hypothesis. The goodness of fit of NHPP to the data is further validated through the Quantile-Quantile (QQ) plot as shown in Figure 3(b), where the samples are compared with an NHPP with rate function. The figure shows a good fit of the NHPP to the data. Hence, the 463 power failures occurred independently, and follow a non-homogeneous Poisson process.
5.3. Empirical Recovery Process
Next we study empirical recovery-time distribution given the failure occurrence time. Our objective is to identify infant and aging recovery.
The 463 samples in our data set consist of durations of the failures that occurred from 7 a.m. September 12 to 4 p.m. September 14. Figure 1(b) shows the joint empirical distribution of these samples. Each bin is two hours on failure occurrence time and 12 hours on failure durations. The height of each bin located at failure occurrence time t and duration d represents the number of failures that occurred at t and lasted for d hours.
5.3.2. Mixture Model
As the failure durations varied with the hurricane (Figure 1(b)), we choose a mixture model for the probability density function for duration give failure time t,
where is the number of mixtures at time t, () is a weighting factor for the jth mixture function, and. Both the mixture functions and the weight factors vary with failure time t for a non-stationary recovery process.
We select a Weibull distribution as a mixture function because the parameters exhibit clear physical meaning  . Mathmatically the Weibull mixtures can be expressed as
where, and are the shape and scale parameters respectively. When the shape parameter is less than 1 and/or the scale parameter are small, signifies short failure durations and thus the infant recovery. Weighting factor characterizes the importance of a component. For a non-stationary recovery process, these parameters vary with failure time t.
5.3.3 Parameter Estimation
The parameters of the Weibull mixtures are estimated from the training samples. For simplicity, we divide the failure time into 5 intervals shown in Figure 1. Within an interval (),is assumed to be a time homogeneous function that does not vary with failure time t. Across different intervals, ‘s have different parameters for non-stationarity. The parameters of the mixtures of the Weibull distribution are estimated through maximum likelihood estimation  , and the estimated mixture distributions are shown in Figure 4. Each mixture represents one cluster of failure durations, and mixture parameters are different for different intervals. For example, in interval (the time when the network was not yet impacted by Hurricane Ike), most failure durations were as short as a few hours. In interval, and (the reigning time of the hurricane), there were more prolonged failures, when restoration became more difficult than that for daily operations. The failure duration exhibits a distinct distribution for different intervals, confirming non-stationarity of the recovery process.
We now study time evolution of resilience. First, we obtain an optimal threshold for this power distribution network. The “optimal” threshold here refers to a best partition between the infant and aging recoveries. Such a partition is obtained empirically from data. Figure 5(a) shows the comparison between and. As we expected, first increased to its maximum value after the failures occurred, and then dropped to its minimum value. The duration between the maximum and the minimum is 15.50 hours. Thus hours is identified as the threshold. The threshold is depicted in Figure 4. The mixture component that on the left of the threshold line corresponds to infant recovery. In total, 50.75% of the failures are categorized as infant recoveries.
The network resilience is then obtained through Equation (15) using the failure rate and distribution of failure duration. Figure 5 shows the time evolution of network resilience. The dynamic evolution of resilience provides the following observations.
Figure 4. Non-stationary (empirical) distribution of failure durations with respect to the failures occurred in the five intervals () depicted in Figure 1(b). The width of each bar is 8 hours. The colors represent different mixtures of Weibull distribution.
Figure 5. (a) Threshold. (b) Dynamic evolution of resilience of the distribution network.
・ Prior to the hurricane, no failures occurred yet, and the resilience was close to 1.
・ A large number of failures then occurred and reduced the resilience to a lower level. How fast the resilience decreased was measured by. The decreasing speed reached the maximum after 16.45 hours since the first failure appeared. In the meantime, the failure rate also reached the maximum.
・ At 3 am September 14th, about 42.7 hours after the first observed failure (24.8 hours after the landfall, and 26.25 hours after the failure rate reached the maximum value) the resilience reached the minimum value. There, 46% (214 out of 463) of total failures were in aging recovery. The maximum reduction of resilience from that of
the normal operation was, where n was the total number of nodes in the network. At this time, the network was experiencing the most impact from the hurricane.
・ After the minimum, the resilience increased when more failures were restored. The impact from the hurricane was fading gradually. It took about 10.7 days for the resilience to return to that of the normal operation from the minimum value.
The dynamic resilience metric resulting from the non-stationary model provides following insights and understanding. First, the static resilience developed in the previous works is overly-pessimistic for quantifying the resilience before and after the landfall of the hurricane, where either few failures occurred or most failures recovered. The static metrics are overly-optimistic around the landfall, where a large number of failures experienced aging recovery. Second, the dynamic resilience metric quantifies joint effects of failure and recovery processes, showing not only the failure rate but also the speed of recovery. Third, the dynamic metric reveals the worst-case resilience of the network during a hurricane. The dynamic resilience also identifies the time when the resilience reached the worst value, showing an important period during a life-cycle of failures and recoveries. For example, the network was the weakest in the duration (26.25 hours) between the failure rate reached its peak value and the resilience attained the minimum, when most failures already occurred but the restoration slowed down to aging recovery. When the network survived the weakest period, the resilience began to improve due to recovery and few additional failures.
We have derived a non-stationary random process to model large-scale failure and recovery of a power distribution network under external disruptions. The resulting model is a dynamic queue that has independent failure increments and time-varying distributions for failure durations. Transient Little’s Law provides a simple characterization of the non-stationary failure and recovery processes through two quantities: a time-varying failure rate and a probability distribution of failure durations. A new metric on resilience is then defined using these two quantities together with a threshold on failure durations. The resilience metric is dynamic, showing both the ability of network to remain operational and recover rapidly from failures during severe weather. A threshold on rapid and slow recovery has been identified as a time duration between the maximum and minimum difference of the failure rate and the recovery rate. A minimum value of the resilience is then obtained to show how much the resilience deviates from that of the normal operation.
We had used real data from an operational network that was impacted by Hurricane Ike. The failure rate and non-stationary probability distribution of failure durations as well as resilience metric are estimated from the real data. The failure process has been shown to be an non-homogenous Poisson process at the time scale of minutes. The recovery-time distribution has been modeled as Weibull mixtures with time-varying parameters. A threshold value is obtained as 15.5 hours for this network, where 50.8% of the failures recovered rapidly. The network resilience reached its minimum value 24.8 hours after the landfall when the aging recoveries were 46% of all failures. The network experienced the most difficult time when the failure rate reached the peak value and the aging recovery dominated until the resilience decreased to the minimum. It then took about 10 days for the network to regain 100% resilience from the minimum value. These observations suggest that enhanced recovery, especially during the most difficult duration, can perhaps reduce the worst impact to the network and improve the overall resilience and the recovery time.
There are several directions for extensions of this work. The first is to utilize spatial and network variables in the non-stationary model. Temporal resilience can then be extended to measure spatiotemporal characteristics. Different time scales may need to be considered to account for the impacts from a system structure. Such extensions are natural as our model is derived from bottom-up starting with nodes at certain geo- and system- locations. Our preliminary work shows a step towards such an extension  . The second is to characterize the impacts of failures and recoveries to customers. This may involve more complex models beyond aggregated non-stationary random processes and Transient Little’s Law.
The authors would like to thank Chris Kung, Jae Won Choi, Daniel Burnham and Xinyu Dai for data processing, Kurt Belgum for helpful comments on the manuscript, Anthony Kuh, Vince Poor, and Nikil Jayant for helpful discussions. The support from the National Science Foundation (ECCS 0952785) is gratefully acknowledged.
1. Proof of Corollary 1
Proof: We begin with the Transient Little’s Theorem. Computing the derivative of both sides of Equation (7), we have
where is the cumulative distribution function of failure duration. Change the order of derivative and integral, we have
The first term on the right-hand-side is. By definition of, and, the second term is, i.e., the recovery rate (Equation (10)).
2. Pearson’s Hypothesis Test
Pearson’s Hypothesis Test: The hypothesis test is based on a chi-square statistic which compares the failure occurrence times with their sample mean. The details of testing that failure occurrence times are drawn from a NHPP are given here.
1) Compute the estimated failure rate. At time t, count the number of failures in,
2) Divide the failure occurrence times into m non-overlapping intervals. Count the number of failure occurrences in each interval. Let denote the number of occurrence in interval i, where. Let.
3) Count, which is the observed number of intervals with j failure occurrences, , where.
4) Use the estimated to compute, which is the expected number of intervals with j failure
5) Compute the sum. is a chi-square statistic, with degree of freedom dof = k −
(number of independent parameter fitted) − 1. Since one parameter is fitted, the degree of freedom is.
6) Given a confidence level, for instance 95%, we obtain a threshold value. The hypothesis is rejected if; otherwise, cannot be rejected.