Back
 OJS  Vol.10 No.5 , October 2020
Consistency of the φ-Divergence Based Change Point Estimator
Abstract: This paper utilizes a change-point estimator based on the φ-divergence. Since we seek a near perfect translation to reality, then locations of parameter change within a finite set of data have to be accounted for since the assumption of stationary model is too restrictive especially for long time series. The estimator is shown to be consistent through asymptotic theory and finally proven through simulations. The estimator is applied to the generalized Pareto distribution to estimate changes in the scale and shape parameters.

1. Introduction

Let x 1 , , x n be a time series of size n. Methods in literature consider stationary models in explaining the underlying data generating process. However, stationarity is arguably a very strong assumption in many real-world applications as process characteristics evolve over time. Reviewed literature reveals that the use of one model may not be appropriate to model a non-stationary series and as such various change-point estimation methods have been proposed. However, they are limited in different ways and their suitability depends on the underlying assumptions. Statistical research works have shown that with time, the underlying data generating processes undergo occasional sudden changes [1]. A change point is said to occur when there exists a time τ { 1 , , n 1 } such that the statistical properties of x 1 , , x τ and x τ + 1 , , x n are different. In its simplest form, change-point detection is the name given to the problem of estimating the point at which the statistical properties of a sequence of observations change [2]. The overall behavior of observations can change over time due to internal systemic changes in distribution dynamics or due to external factors. Time series data entail changes in the dependence structure and therefore modelling non-stationary processes using stationary methods to capture their time-evolving dependence aspects will most likely result in a crude approximation as abrupt changes fail to be accounted for [3]. Each change point is an integer between 1 and n − 1 inclusive. The process X is assumed to be piece-wise stationary implying that some characteristics of the process change abruptly at unknown points in time. The corresponding segments are then said to be homogeneous within but each of the subsequent segments is heterogeneous in characteristics. For a parametric model the parameters associated with the ith segment denoted θ i , are assumed to contain changes. Parametric tests for change point are mainly based on the likelihood ratio statistics and estimation based on the maximum likelihood method whose general results can be found in [4].

Detection of change points is critical to statistical inference as a near perfect translation to reality is sought through model selection and parameter estimation. Parametric methods assume models for a given set of empirical data. Within a parametric setting change, points can be attributed to change in the parameters of the underlying data distribution. Generally, change point methods can be compared based on general characteristics and properties such as test size, power of the test or the rate of convergence to estimate the correct number of change point and the change-point locations. Change point problems can be classified as off-line which deals with only a fixed sample or on-line which considers new information as it observed. Off-line change point problems deal with fixed sample sizes which are first observed and then detection and estimation of change points are done. [5] introduced the change point problem within the off-line setting. Since this pioneering work, methodologies used for change point detection have been widely researched on with methods extending to techniques for higher order moments within time series data. Ideally, it is desired to test how many change points are present within a given set of data and to estimate the parameters associated with each segment. If τ is known then the two samples only need to be compared. However, if τ is unknown then it has to be analyzed through change point analysis that entails both detection and estimation of the change point/change time. The null hypothesis of no change against the alternative that there exists a time when the distribution characteristics of the series changed is then tested. Stationarity in the strict sense, implies time-invariance of the distribution underlying the process.

The hypotheses would be stated as:

H 0 : F ( x ; θ ) = F ( x ; θ 1 ) for t = 1 , , n H 1 : F ( x ; θ ) = { F ( x ; θ 1 ) for t = 1 , , τ F ( x ; θ 2 ) for t = τ + 1 , , n (1)

The null hypothesis postulates that the distribution remains unchanged throughout within the sample of size n whereas the alternative postulates no change as in the null up to time τ when change occurs. Then the change point problem is to test the hypotheses about the population parameter(s)

H 0 : θ 1 = θ 2 = = θ n versus H 1 : θ 1 = = θ τ θ τ + 1 = = θ n (2)

where τ is unknown and needs to be estimated. If τ < n then the process distribution has changed and τ is referred to as the change point. We assume that there exists λ [ 0 , 1 ] such that τ satisfies

τ = λ n (3)

where n is the number of observations in a given data set. Then hypothesis 2 can be restated as

H 0 : τ = n , ( λ = 1 ) H 1 : τ < n , ( 0 < λ < 1 ) (4)

At a given level of significance, if the null hypothesis is rejected, then the process X is said to be locally piecewise-stationary and can be approximated by a sequence of stationary processes that may share certain features such as the general functional form of the distribution F. Many authors such as [6] - [11] have considered both parametric and non-parametric methods of change point detection in time series data. Ideally, change points cannot be assumed to be known in advance hence the need for various methods of detection and estimation.

This paper is organized as follows: Section 2 gives an overview of the change point estimator based on a pseudo-distance measure. Section 3 provides key results for consistency of the estimator. Section 4 provides an application of the change point estimator to the shape and scale parameters of the generalized Pareto distribution. Section 5 gives an application of the estimator and consistency is shown through simulations. Finally 6 provides concluding remarks.

2. Change Point Estimator

The change point problem is addressed by using a “distance” function between distributions to describe the change. Given a distance function, a test statistic is constructed to guarantee a distance > ϵ ( ϵ 0 ) between any two distributions based on a sample size n. Consider a given parametric model f θ : θ Θ where Θ is the parameter space defined on a data set of size n. Let X 1 , , X n be random variables and have probability densities f ( x ; θ 1 ) , , f ( x ; θ n ) with respect to σ-finite measure µ with F ( x ; θ ) generating distinct measures if θ Θ

Definition 2.1 ( ϕ -divergence). Let F θ 1 and F θ 2 be two probability distributions. Define the ϕ -divergence between the two distributions as

D ϕ ( F θ 1 , F θ 2 ) = D ϕ ( θ 1 , θ 2 )

The broader family of ϕ -divergences that take the general form

D ϕ ( θ 1 , θ 2 ) = ϕ ( d F θ 1 d F θ 2 ) d F θ 2

= f θ 2 ( x ) ϕ ( f θ 1 ( x ) f θ 2 ( x ) ) d μ ( x ) = E θ 2 [ ϕ ( f θ 1 ( x ) f θ 2 ( x ) ) ] , ϕ Φ (5)

where Φ is the class of all convex functions ϕ ( t ) , t > 0 satisfying ϕ ( 1 ) = 0 , ϕ ( 1 ) > 0 .

Assumption 1. The function ϕ Φ : [ 0 , ) ( , + ) is convex and continuous. The restriction on [ 0 , ) is finite, twice continuously differentiable with ϕ ( 1 ) = ϕ ( 1 ) = 0 , ϕ ( 1 ) = 1 .

At any point t = 0, to avoid indeterminate expressions [12] gives the following assumptions in relation to the functions ϕ involved in the general definition of ϕ -divergence statistics,

0 ϕ ( 0 0 ) = 0 0 ϕ ( p 0 ) = lim u ϕ ( u ) u (6)

These assumptions ensure the existence of the integrals. Different choices of φ result in many divergences that play important roles in statistics including the Kullback-Leibler ϕ ( t ) = ln ( t ) , total variation ϕ ( t ) = | t 1 | among others. D ϕ ( θ 1 , θ 2 ) D ϕ ( θ 2 , θ 1 ) hence divergence measures are not distance measures but give some difference between two probability measures hence the term “pseudo-distance”. More generally a divergence measure is a function of two probability density (or distribution) functions, which has non-negative values and takes the value zero only when the two arguments (distributions) are the same. A divergence measure grows larger as two distributions are further apart. Hence, a large divergence implies departure from the null hypothesis.

Generally, a change point problem’s objective would be to propose an estimator for the possible change-point τ given a set of random variables.

Based on the divergence in 5 then a change point estimator can be constructed as;

D n τ = max 1 < τ < n ( λ ( 1 λ ) ) 2 ϕ ( 1 ) D ϕ ( θ 1 , θ 2 ) (7)

where λ = τ n Λ : Λ = [ 0 , 1 ] and θ 1 ^ , θ 2 ^ are the maximum likelihood estimates of the parameters before and after the change point.

To test for the possibility of having a change in distribution of x 1 , , x n it is natural to compare the distribution function of the first τ observations to that of the last (n − τ) since the location of the change time is unknown. When τ is near the boundary points, say near 1 or near n then we are required to compare an estimation calculated on a correct large number of observations (n − τ) to an estimation from a small number of observations τ. This may result to an erratic behavior of the test statistic [7] due to instability of the estimators of the parameters. If λ is not bounded away from zero and one, then the test statistic does not converge in distribution i.e. the critical values for the test statistic diverge to infinity as n → ∞ to obtain a sequence of level α tests [13]. However, fixed critical values can be obtained for increasing sample sizes when λ is bounded away from zero and one and yields significant power gains if the change point is in Λ.

Let ε > 0 be small enough such that λ ( ϵ , 1 ϵ )

Suppose that λ maximizes the test statistic over [0, 1] then under the null hypothesis,

sup λ ( ϵ , 1 ϵ ) D ( λ ) = O p ( 1 ) ϵ sup λ [ 0 , 1 ] D ( λ ) as n (8)

[13]. By this result and for N ( ϵ ) = ϵ n , , ( 1 ϵ ) n then the test statistic becomes,

D n τ = max τ N ( ϵ ) ( τ n ( 1 τ n ) ) 2 ϕ ( 1 ) D ϕ ( θ 1 , θ 2 ) (9)

The change-point estimator τ ^ of a change point τ is the point at which there is maximal sample evidence for a change in distributional parameters characterized by maximum divergence. It is estimated by the least value of τ that maximizes the test statistic 9.

τ ^ = min { τ : D n τ = max τ N ( ϵ ) ( τ n ( 1 τ n ) ) 2 ϕ ( 1 ) D ϕ ( θ 1 , θ 2 ) } (10)

3. Consistency of the Change Point Estimator

A minimal requirement for a good statistical decision rule is its increasing reliability with increasing sample sizes [14].

Let x 1 , , x n be a sample of fixed size n with the density function f ( x ; θ ) for θ Θ R d and L ( x ; θ ) be the likelihood function. It can be shown that by Taylor’s theorem under the null hypothesis, the ϕ , divergence based estimator can be reduced to a two-sample Wald-type test statistic of the form

W n τ ^ = max τ N ( ϵ ) ( τ n ( 1 τ n ) ) ( θ 1 ^ θ 1 ) I ( θ 0 ) ( θ 2 ^ θ 2 ) (11)

Suppose x 1 , , x n are iid random variables of size n with probability density function f ( x ; θ ) with θ = ( θ 1 , , θ k ) , k < n being the vector of parameters governing the pdf. The likelihood function can be expressed as

L ( θ | x ) = i = 1 n f ( x i ; θ ) (12)

It is more convenient to work with the logarithm of the likelihood function given by

l ( θ | x ) = i = 1 n log f ( x i ; θ ) (13)

Since the logarithm is a monotone increasing function, maximizing the likelihood function is equivalent to maximizing the log-likelihood function. Introduce the following notations:

θ log f ( x i ; θ ) = θ log f ( x i ; θ ) (14)

θ 2 log f ( x i ; θ ) = 2 θ i θ j log f ( x i ; θ ) (15)

H n ( θ ) = 1 n i = 1 n θ 2 log f ( x i ; θ ) (16)

U j m ( θ ) = i = j m θ log f ( x i ; θ ) , 1 j m n (17)

The following equalities hold as n .

H n ( θ ) I ( θ ) H n ( θ ) + 1 n I ( θ ) 0 (18)

On assumption that θ 1 θ 2 for θ 1 , θ 2 Θ R d , then

θ 1 ^ θ 1 , θ 2 ^ θ 2 as n θ 1 ^ and θ 2 ^ aresolutionsto i = 1 τ θ log f ( x i ; θ ) = 0 and i = τ + 1 n θ log f ( x i ; θ ) = 0 respectively . (19)

Theorem 3.1. Let 0 < δ 1 < δ 2 < , n 1 = n δ 1 , n 2 = n δ 2

lim τ max { τ 1 2 U τ ( θ 0 ) : n 1 < τ < n 2 } = O p ( 1 ) (20)

Theorem 3.2. Let 0 < ϵ < 1 ϵ for ϵ > 0 small enough. Then as n

max { τ 1 2 I ( θ 0 ) 1 U n τ ( θ 0 ) : τ N ( ϵ ) } = O p ( 1 ) (21)

For the proof of theorems 3.1 and 3.2 see [15].

Theorem 3.3. Let 0 < δ 1 < δ 2 < and n 1 = n δ 1 , n 2 = n δ 2 For n 1

lim n { max [ n 1 / 2 ( θ n ^ θ 0 ) 1 n I ( θ 0 ) 1 U n ( θ 0 ) : n 1 < n 2 ] } 0 (22)

Proof

0 = U n ( θ n ^ ) = U n ( θ 0 ) + ( θ n ^ θ 0 ) U n ( θ 0 ) + 1 2 ( θ n ^ θ 0 ) 2 U n ( θ 0 ) = U n ( θ 0 ) n H n ( θ 0 ) + R ˜ = n 1 2 U n ( θ 0 ) n H n ( θ 0 ) n 1 2 ( θ n ^ θ 0 ) + n 1 2 R ˜ (23)

The third term on the RHS is o p ( 1 ) [14]. By definition of MLE U n ( θ n ^ ) = 0 .

U n ( θ 0 ) = n H n ( θ n ) ( θ n ^ θ 0 ) n 1 2 U n ( θ 0 ) = n 1 2 H n ( θ 0 ) ( θ n ^ θ 0 ) (24)

From Equation (24) we obtain

n 1 2 ( θ n ^ θ 0 ) = n 1 2 H n ( θ 0 ) 1 U n ( θ 0 )

Hence

n 1 2 ( θ n ^ θ 0 ) 1 n I ( θ 0 ) 1 U n ( θ 0 ) H n ( θ 0 ) 1 I ( θ 0 ) n 1 2 U n ( θ 0 ) (25)

But by Equation (18)

H n ( θ 0 ) 1 I ( θ 0 ) 0

By theorem 3.1 n 1 2 U n ( θ 0 ) is bounded in probability. Hence the proof.

Theorem 3.4. Let 0 < ϵ < 1 ϵ for ϵ > 0 small enough. Then

lim n { max [ τ 1 / 2 ( θ 1 ^ θ 2 ^ ) n τ ( n τ ) I ( θ 0 ) 1 U n τ ( θ 0 ) : τ N ( ϵ ) ] } 0 (26)

Proof

( θ 1 ^ θ 2 ^ ) n τ ( n τ ) I ( θ 0 ) 1 U n τ ( θ 0 ) = ( θ 1 ^ θ 0 1 τ I ( θ 0 ) 1 U 1 τ ( θ 0 ) ) ( θ 2 ^ θ 0 1 n τ I ( θ 0 ) 1 U τ + 1 , n ( θ 0 ) ) max τ 1 2 ( θ 1 ^ θ 2 ^ ) n τ ( n τ ) I ( θ 0 ) 1 U n τ ( θ 0 ) max τ 1 2 ( θ 1 ^ θ 0 1 τ I ( θ 0 ) 1 U 1 τ ( θ 0 ) ) max τ 1 2 ( θ 2 ^ θ 0 1 n τ I ( θ 0 ) 1 U τ + 1 , n ( θ 0 ) ) (27)

Considering the term on the RHS, by theorem 3.3. For τ , n

max τ 1 2 ( θ 1 ^ θ 0 1 τ I ( θ 0 ) 1 U 1 τ ( θ 0 ) ) 0 max τ 1 2 ( θ 2 ^ θ 0 1 n τ I ( θ 0 ) 1 U τ + 1 , n ( θ 0 ) ) 0

Hence the proof.

Assume that within a finite set of data a change point τ exists and n such that τ , ( n τ )

Define,

U n τ ( θ 0 ) = U 1 τ ( θ 0 ) τ n U 1 n ( θ 0 )

Consider the following two sample homogeneity test

Q n τ = n τ ( n τ ) U n τ ( θ 0 ) I ( θ 0 ) U n τ ( θ 0 ) for 1 < τ < n (28)

[15] defined a consistent estimate of 28 as

Q n τ ^ = n τ ( n τ ) U τ ( θ n ^ ) ( H n ( θ n ^ ) ) 1 U τ ( θ n ^ ) (29)

By the principles of maximum likelihood estimation, Q n n = 0 since U n ( θ n ^ ) = 0 , Q n 0 = 0 since U 0 ( . ) = 0 .

Consider U τ ( θ n ^ ) . By Taylor’s theorem,

U τ ( θ n ^ ) = U τ ( θ τ ) + ( θ n ^ θ τ ^ ) U τ ( θ τ ) U τ ( θ τ ) = i = 1 τ 2 θ τ 2 log f ( x ; θ τ ) = ( θ n ^ θ τ ^ ) i = 1 τ 2 θ τ 2 log f ( x ; θ τ ) = τ H τ ( θ τ ^ ) ( θ n ^ θ τ ^ ) (30)

Since by the principle of maximum likelihood estimation U τ ( θ τ ) = 0 .

U τ ( θ n ^ ) = τ H τ ( θ τ ^ ) ( θ τ ^ θ n ^ ) (31)

( θ τ ^ θ n ^ ) = τ { H τ ( θ τ ^ ) } 1 U τ ( θ n ^ ) = τ { H τ ( θ τ ^ ) } 1 { U 1 τ ( θ 0 ^ ) τ n U 1 n ( θ 0 ^ ) } = τ { H τ ( θ τ ^ ) } 1 { 1 τ U 1 n ( θ 0 ^ ) 1 n τ U 1 n ( θ 0 ^ ) } = τ { H τ ( θ τ ^ ) } 1 { 1 n τ U 1 n ( θ 0 ^ ) 1 τ U 1 n ( θ 0 ^ ) } = n τ n τ { H τ ( θ τ ^ ) } 1 U 1 n ( θ 0 ^ ) = n τ n τ I ( θ τ ̂ ) U 1 n ( θ 0 ^ ) (32)

By the CLT,

( θ τ ^ θ n ^ ) N ( 0 , n τ n τ I ( θ τ ) 1 )

and thus ( θ τ ^ θ n ^ ) has squared Mahalanobis norm

( θ τ ^ θ n ^ ) ( n τ n τ I ( θ τ ) 1 ) 1 ( θ τ ^ θ n ^ ) (33)

Hence

( θ τ ^ θ n ^ ) ( n τ n τ I ( θ τ ) 1 ) 1 ( θ τ ^ θ n ^ ) Q n τ ^ (34)

implying that Q n τ is approximately equal to the Mahalanobis norm of ( θ τ ^ θ n ^ ) . The Mahalanobis norm can be used to detect change points within a given finite time series data [11]. Since the test statistic 33 can quantify the difference between ( θ n ^ ) and ( θ τ ^ ) then Q n τ ^ can similarly be used to quantify the deviation between the two parameter estimates. The value of Q n τ ^ ideally grows larger in evidence of the alternative hypothesis and tends towards zero when the null hypothesis is true. Suppose we define a maximal type test statistic Q n τ ^ ( t ) such that

Q n τ ^ = max { Q n τ ^ ( t ) : t N ( ϵ ) } (35)

then we can obtain a measure of the largest difference between ( θ n ^ ) and ( θ τ ^ ) . Consider the divergence based estimator which was reduced to a two sample test statistic in Equation (11).

Definition 3.1. A matrix M is called positive definite if x M x 0 , x R n , with equality if and only if x = 0 . The following inequality holds,

x M x y M y | M | { x y 2 + 2 y x y } (36)

Consider the following result

| W n τ Q n τ | = | τ ( n τ ) n ( θ 1 ^ θ 2 ^ ) I ( θ 0 ) ( θ 1 ^ θ 2 ^ ) { n τ ( n τ ) I ( θ 0 ) 1 U n τ ( θ 0 ) } I ( θ 0 ) { n τ ( n τ ) I ( θ 0 ) 1 U n τ ( θ 0 ) } | (37)

By inequality 36

| W n τ Q n τ | < | I ( θ 0 ) | { ( θ 1 ^ θ 2 ^ ) n τ ( n τ ) I ( θ 0 ) 1 U n τ ( θ 0 ) 2 + 2 n τ ( n τ ) I ( θ 0 ) 1 U n τ ( θ 0 ) ( θ 1 ^ θ 2 ^ ) n τ ( n τ ) I ( θ 0 ) 1 U n τ ( θ 0 ) } (38)

Consider the last term on the RHS. By the result of theorem 3.4

( θ 1 ^ θ 2 ^ ) n τ ( n τ ) I ( θ 0 ) 1 U n τ ( θ 0 ) 0 (39)

And hence

( θ 1 ^ θ 2 ^ ) n τ ( n τ ) I ( θ 0 ) 1 U n τ ( θ 0 ) 2 0 (40)

Considering the second term on the RHS. By the results in theorem 3.2,

I ( θ 0 ) 1 U n τ ( θ 0 ) = O p ( 1 ) (41)

From these results as τ , n

| W n τ Q n τ | 0 (42)

Definition 3.2. (Asymptotic consistency). A change point detection algorithm is said to be asymptotically consistent if the estimated segmentation is such that

max | τ ^ n τ n | 0 (43)

The change point fractions are consistent, and not the indexes themselves. Consistency results in the literature only deal with change point fractions since the distances | τ ^ τ | and their estimated counter parts do not converge to zero [11].

4. Change Point Analysis in the Generalized Pareto Distribution

Definition 4.1. The Generalized Pareto distribution function is defined by;

H ( x ) = { 1 ( 1 + ξ x σ ) 1 ξ , ξ 0 1 exp ( x σ ) , ξ = 0 (44)

where,

x { [ 0 , ) , ξ 0 [ 0 , σ ξ ] , ξ < 0

σ is referred to as the scale parameter characterizes the spread of the distribution and ξ referred to as the tail index/shape parameter determines the tail thickness. More specifically, given that X G P ( σ , ξ ) then the probability density function is;

h ( x ) = { 1 σ ( 1 + ξ x σ ) 1 ξ 1 , ξ 0 1 σ exp ( x σ ) , ξ = 0 (45)

For any given finite set of data, at least one of the following is likely at any given change point τ ( 1 < τ < n ) : ξ changes by a non-zero quantity; σ changes by a non-zero quantity; both ξ and σ change by non-zero quantities. A simple change point problem can be formulated in one of the following ways;

H 0 : X t ~ G P ( σ 1 , ξ 1 ) against H 1 : X t ~ G P ( σ 1 , ξ 1 ) t τ X t ~ G P ( σ 2 , ξ 2 ) t > τ (46)

H 0 : X t ~ G P ( σ 1 , ξ 1 ) against H 1 : X t ~ G P ( σ 1 , ξ 1 ) t τ X t ~ G P ( σ 1 , ξ 2 ) t > τ (47)

H 0 : X t ~ G P ( σ 1 , ξ 1 ) against H 1 : X t ~ G P ( σ 1 , ξ 1 ) t τ X t ~ G P ( σ 2 , ξ 1 ) t > τ (48)

Since change points are unknown in advance, then either of the three hypothesis formulations is likely. Without knowledge on the types of changes contained in the time series, the question arises on which testing procedure to use. In most instances hypotheses 46 is tested since it is assumed that both distributional parameters change.

Figure 1 shows different GP density plots with a constant scale parameter but varying shape parameters. On the other hand, Figure 2 shows different GP density plots with a both scale and shape parameters varying. If any of the parameters were to change at any given point in time, then the thickness of the general tail distribution would change and this would in turn have an effect of the intensity of extreme values observed.

Assume that X is independently and identically distributed random variables drawn for the generalized Pareto distribution and consider a sample data set x 1 , , x τ , x τ + 1 , , x n of fixed size n(n ≥ 3). Say f θ 1 is governed by the parameter space θ 1 = ( ξ 1 , σ 1 ) and f θ 2 is governed by the parameter space θ 2 = ( ξ 2 , σ 2 ) where θ 1 θ 2 Θ . The data set is assumed to contain an unknown change point τ where the distribution parameters ξ and σ abruptly change. Then

x 1 , , x τ ~ f θ 1 ( x ) x τ + 1 , , x n ~ f θ 2 ( x )

Then the density function 49 governs the first τ observations and 50 governs the last ( n τ ) observations.

Figure 1. Density plot with constant scale.

Figure 2. Density plot with varying scale and shape.

f θ 1 ( x ) = { 1 σ 1 ( 1 + ξ 1 x σ 1 ) 1 ξ 1 1 , ξ 1 0 1 σ 1 exp ( x σ 1 ) , ξ 1 = 0 (49)

f θ 2 ( x ) = { 1 σ 2 ( 1 + ξ 2 x σ 2 ) 1 ξ 2 1 , ξ 1 0 1 σ 2 exp ( x σ 2 ) , ξ 1 = 0 (50)

We will restrict to the case where ξ > 0 i.e. heavy tailed distributions thereby only considering the first part of the density function with support x [ 0 , ) .

From the divergence in Equation (5), let ϕ ( t ) = log ( t )

D ϕ ( θ 1 , θ 2 ) = f θ 2 ( x ) ϕ ( f θ 1 ( x ) f θ 2 ( x ) ) d μ ( x ) = f θ 2 ( x ) log ( f θ 1 ( x ) f θ 2 ( x ) ) d μ ( x ) = f θ 2 ( x ) log ( f θ 2 ( x ) f θ 1 ( x ) ) d μ ( x ) = D K L ( f θ 2 ( x ) , f θ 1 ( x ) ) D K L ( θ 2 , θ 1 ) (51)

An application of properties of the generalized Pareto distribution [16], numerical computations and methods of integration the divergence between two generalized Pareto distributions becomes

D K L ( θ 2 , θ 1 ) = log ( σ 2 σ 1 ) ( 1 + ξ 1 ) ( 1 ξ 2 + 1 ) σ 1 σ 2 ξ 1 ( 1 + ξ 1 σ 1 x ) 1 ξ 1 ( ξ 2 σ 2 + 1 ) 1 d x (52)

The divergence is a function of the parameters of the two densities.

5. Simulation Study

The performance of the estimator is examined by considering the effects of the change in sample size. The single change-point estimation problem is considered where the change-point τ is fixed at n/2 for n = 200, 500, 1000. Figures 3-5 display the plots for the location of the change-point estimator as estimated by the proposed estimator 10 with the divergence measure as in 52 for the various sample sizes. The hypothesis considered here is

H 0 : X t ~ G P ( 1 , 0.1 ) against H 1 : X t ~ G P ( 1 , 0.1 ) t τ X t ~ G P ( 3 , 0.35 ) t > τ (53)

Figure 3. Sample size= 200, τ = 100 , τ ^ = 88 .

Figure 4. Sample size= 500, τ = 250 , τ ^ = 245 .

Figure 5. Sample size= 1000, τ = 500 , τ ^ = 494 .

To check consistency of the estimator, we consider the following: first, we consider data simulated from the GP density with parameters ( 1 , ξ 1 ) and ( 3 , ξ 2 ) for the scale and shape respectively before and after the change point. 1000 simulations are carried out to estimate the change point and the results are given in Table 1 and Table 2.

6. Conclusion

In this paper, a divergence (pseudo-distance) based estimator is used to detect change points within a parametric framework focusing on the generalized Pareto

Table 1. Effect of the sample size with varying scale and varying shape ( τ = n / 2 ).

Table 2. Effect of the sample size with varying scale and varying shape ( τ = n / 3 ).

distribution. Change points are attributed to the change in model parameters at unknown points in time with the parameter estimates before and after the change point unknown. The estimator is shown to be consistent theoretically. Simulation studies also show that the change point estimator is consistent.

Acknowledgements

The first author thanks the Pan-African University Institute of Basic Sciences, Technology and Innovation (PAUSTI) for funding this research.

Appendix

Derivation of the change point estimator W n τ

Consider a second order Taylor expansion of D ϕ ( θ 1 ^ , θ 2 ^ ) about the true parameter values θ 1 , θ 2

For i = 1 , , d

D ϕ ( θ 1 ^ , θ 2 ^ ) = D ϕ ( θ 1 , θ 2 ) + i = 1 d D ϕ ( θ 1 , θ 2 ) θ 1 i ( θ 1 i ^ θ 1 i ) + i = 1 d D ϕ ( θ 1 , θ 2 ) θ 1 i ( θ 1 i ^ θ 1 i ) + 1 2 i = 1 d i = 1 d 2 D ϕ ( θ 1 , θ 2 ) θ 1 i θ 1 j ( θ 1 i ^ θ 1 i ) ( θ 1 i ^ θ 1 i ) + 1 2 i = 1 d i = 1 d 2 D ϕ ( θ 1 , θ 2 ) θ 2 i θ 2 j ( θ 2 i ^ θ 2 i ) ( θ 2 j ^ θ 2 j ) + i = 1 d i = 1 d 2 D ϕ ( θ 1 , θ 2 ) θ 1 i θ 2 j ( θ 1 i ^ θ 1 i ) ( θ 2 j ^ θ 2 j ) + o ( θ 1 ^ θ 1 2 ) + o ( θ 2 ^ θ 2 2 ) (54)

Under the assumption of the null hypothesis,

D ϕ ( θ 1 , θ 2 ) θ 1 i = ϕ ( f θ 1 ( x ) f θ 2 ( x ) ) f θ 1 ( x ) θ 1 i d μ ( x ) = ϕ ( 1 ) f θ 1 ( x ) θ 1 i d μ ( x ) = ϕ ( 1 ) θ 1 i f θ 1 ( x ) d μ ( x ) = 0 (55)

This is by assumption 1 and that interchanges of derivatives and integrals are valid.

2 D ϕ ( θ 1 , θ 2 ) θ 1 i θ 1 j = f θ 1 ( x ) θ 1 i ϕ ( f θ 1 ( x ) f θ 2 ( x ) ) f θ 1 ( x ) θ 1 j 1 f θ 2 ( x ) d μ ( x ) = ϕ ( 1 ) f θ 1 ( x ) θ 1 i f θ 1 ( x ) θ 1 j 1 f θ 1 ( x ) d μ ( x ) = f θ 1 ( x ) θ 1 i f θ 1 ( x ) θ 1 j 1 f θ 1 ( x ) d μ ( x )

2 D ϕ ( θ 1 , θ 2 ) θ 2 i θ 2 j = ϕ ( f θ 2 ( x ) f θ 1 ( x ) ) f θ 2 ( x ) θ 2 i f θ 2 ( x ) θ 2 j 1 f θ 1 ( x ) d μ ( x ) = ϕ ( 1 ) f θ 2 ( x ) θ 2 i f θ 2 ( x ) θ 2 j 1 f θ 1 ( x ) d μ ( x ) = ϕ ( 1 ) f θ 2 ( x ) θ 2 i f θ 2 ( x ) θ 2 j 1 f θ 1 ( x ) d μ ( x ) (56)

2 D ϕ ( θ 1 , θ 2 ) θ 1 i θ 2 j = ϕ ( f θ 1 ( x ) f θ 2 ( x ) ) f θ 1 ( x ) θ 1 i f θ 2 ( x ) θ 2 j 1 ( f θ 2 ( x ) ) 2 f θ 1 ( x ) d μ ( x ) = ϕ ( 1 ) f θ 1 ( x ) θ 1 i f θ 2 ( x ) θ 2 j 1 f θ 1 ( x ) d μ ( x ) = { 2 D ϕ ( θ 1 , θ 2 ) θ 1 i θ 1 j }

By the standard regularity assumptions (theorem 5.2.1) [14], then

2 D ϕ ( θ 1 , θ 2 ) θ 1 i θ 1 j = 2 D ϕ ( θ 1 , θ 2 ) θ 2 i θ 2 j = I ( θ ) 2 D ϕ ( θ 1 , θ 2 ) θ 1 i θ 2 j = I ( θ ) (57)

Using the arguments in (55)-(57) Equation (54) reduces to

1 2 ( θ 1 ^ θ 1 ) I ( θ 1 ) ( θ 1 ^ θ 1 ) + 1 2 ( θ 2 ^ θ 1 ) I ( θ 1 ) ( θ 2 ^ θ 1 ) ( θ 1 ^ θ 1 ) I ( θ 1 ) ( θ 2 ^ θ 2 ) + o ( θ 1 ^ θ 1 2 ) + o ( θ 2 ^ θ 2 2 ) (58)

Further,

2 ϕ ( 1 ) D ϕ ( θ 1 ^ , θ 2 ^ ) = ( θ 1 ^ θ 2 ) I ( θ 1 ) ( θ 1 ^ θ 2 ) + o ( θ 1 ^ θ 1 2 ) + o ( θ 2 ^ θ 2 2 ) (59)

Assuming that a change point τ divides the data into two heterogeneous parts with the parameters θ 1 , θ 2 before and after the change point respectively with sample sizes τ, (n − τ) respectively, then by the regularity conditions the mles’s are such that

τ ( θ 1 ^ θ 1 ) N ( 0 , I ( θ 1 ) 1 ) n τ ( θ 2 ^ θ 2 ) N ( 0 , I ( θ 2 ) 1 ) (60)

For

Let then,

τ ( n τ ) n ( θ 1 ^ θ 1 ) N ( 0 , λ I ( θ 1 ) 1 ) τ ( n τ ) n ( θ 2 ^ θ 2 ) N ( 0 , ( 1 λ ) I ( θ 2 ) 1 ) (61)

By the assumption of the null hypothesis θ 1 = θ 2 = θ 0 ,

τ ( n τ ) n ( θ 2 ^ θ 1 ^ ) N ( 0 , I ( θ 0 ) 1 ) (62)

under the assumption that the parameter estimates are consistent.

Suppose that under the maximum likelihood estimation for a sample of fixed size n, θ n ^ θ as n . By the law of large numbers, the observed information matrix is such that,

I n ( θ ) = [ 1 n i = 1 n 2 θ i θ j log f ( x ; θ ) ] [ E ( 2 θ i θ j log f ( x ; θ ) ) ] = I ( θ ) (63)

If we substitute θ n ^ for θ

I n ( θ ) = [ 1 n i = 1 n 2 θ i θ j log f ( x ; θ ) ] θ = θ n [ E ( 2 θ i θ j log f ( x ; θ ) ) ] = I ( θ ) (64)

which is defined as a consistent estimator of the information matrix.

The elements of I ( θ ) are continuous in θ and it holds that

I n ( θ ) I ( θ ) as n (65)

From Equation (59) we obtain

τ ( n τ ) n ( θ 1 ^ θ 2 ) I ( θ 1 ) ( θ 1 ^ θ 2 ) + o ( θ 1 ^ θ 1 2 ) + o ( θ 2 ^ θ 2 2 ) (66)

From Equation (9) and Equations (56)-(66) then the test statistic can be expressed as

D n τ = max τ N ( ϵ ) τ ( n τ ) n { ( θ 1 ^ θ 2 ) I ( θ 1 ) ( θ 1 ^ θ 2 ) + o ( θ 1 ^ θ 1 2 ) + o ( θ 2 ^ θ 2 2 ) } (67)

Let

max τ N ( ϵ ) τ ( n τ ) n { ( θ 1 ^ θ 2 ) ^ I ( θ 1 ) ^ ( θ 1 ^ θ 2 ) } = W n τ max τ N ( ϵ ) D n τ = max τ N ( ϵ ) W n τ + o ( θ 1 ^ θ 1 2 ) + o ( θ 2 ^ θ 2 2 ) (68)

But

o ( θ 1 ^ θ 1 2 ) = o p ( 1 ) o ( θ 2 ^ θ 2 2 ) = o p ( 1 )

Since the second and third terms of 67 are o p ( 1 ) then the distribution of D n τ is similar to that of W n τ .

Cite this paper: Susan, M. , Waititu, A. , Mwita, P. and Wamwea, C. (2020) Consistency of the φ-Divergence Based Change Point Estimator. Open Journal of Statistics, 10, 832-849. doi: 10.4236/ojs.2020.105048.
References

[1]   Brodsky, E. and Darkhovsky, B.S. (2013) Nonparametric Methods in Change Point Problems. Vol. 243, Springer Science & Business Media, Berlin.

[2]   Killick, R. and Eckley, I. (2014) Change Point: An R Package for Change Point Analysis. Journal of Statistical Software, 58, 1-19.
https://doi.org/10.18637/jss.v058.i03

[3]   Korkas, K.K. and Fryzlewicz, P. (2017) Multiple Change-Point Detection for Non-Stationary Time Series Using Wild Binary Segmentation. Statistica Sinica, 27, 287-311.
https://doi.org/10.5705/ss.202015.0262

[4]   Csorgo, M. and Horváth, L. (1997) Limit Theorems in Change-Point Analysis. Vol. 18, John Wiley & Sons Inc., Hoboken.

[5]   Page, E. (1955) A Test for a Change in a Parameter Occurring at an Unknown Point. Biometrika, 42, 523-527.
https://doi.org/10.1093/biomet/42.3-4.523

[6]   Cheng, L., AghaKouchak, A., Gilleland, E. and Katz, R.W. (2014) Non-Stationary Extreme Value Analysis in a Changing Climate. Climatic Change, 127, 353-369.
https://doi.org/10.1007/s10584-014-1254-5

[7]   Jarusková, D. and Rencová, M. (2008) Analysis of Annual Maximal and Minimal Temperatures for Some European Cities by Change Point Methods. Environmetrics, 19, 221-233.
https://doi.org/10.1002/env.865

[8]   Naveau, P., Guillou, A. and Rietsch, T. (2014) A Non-Parametric Entropy-Based Approach to Detect Changes in Climate Extremes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76, 861-884.
https://doi.org/10.1111/rssb.12058

[9]   Dupuis, D., Sun, Y. and Wang, H.J. (2015) Detecting Change-Points in Extremes. Statistics and Its Interface, 8, 19-31.
https://doi.org/10.4310/SII.2015.v8.n1.a3

[10]   Dette, H. and Wu, W. (2018) Change Point Analysis in Non-Stationary Processes: A Mass Excess Approach.

[11]   Truong, C., Oudre, L. and Vayatis, N. (2018) A Review of Change Point Detection Methods.

[12]   Pardo, L. (2018) Statistical Inference Based on Divergence Measures. Chapman and Hall/CRC, London.
https://doi.org/10.1201/9781420034813

[13]   Andrews, D.W. (1993) Tests for Parameter Instability and Structural Change with Unknown Change Point. Econometrica: Journal of the Econometric Society, 61, 821-856.
https://doi.org/10.2307/2951764

[14]   Sen, P.K. and Singer, J.M. (2017) Large Sample Methods in Statistics (1994): An Introduction with Applications. CRC Press, Boca Raton.
https://doi.org/10.1201/9780203711606

[15]   Hawkins Jr., D.L. (1983) Sequential Detection Procedures for Autoregressive Processes. Dept. of Statistics, Tech. Rep., North Carolina State University, Raleigh.

[16]   Embrechts, P., Klüppelberg, C. and Mikosch, T. (2013) Modelling Extremal Events: For Insurance and Finance. Vol. 33, Springer Science & Business Media, Berlin.

 
 
Top