Limit Distribution of the φ-Divergence Based Change Point Estimator

Show more

1. Introduction

Many real world data are made up of consecutive regimes that are separated by abrupt changes [1]. Statistical research works have shown that with time, the underlying data generating processes undergo occasional sudden changes [2]. As a result the assumption of stationarity is often too strong and more often violated. Stationarity in the strict sense, implies time-invariance of the distribution underlying the process. The overall behavior of observations can change over time due to internal systemic changes in distribution dynamics or due to external factors. Modeling time series processes using stationary methods to capture their time-evolving dependence aspects will most likely result in a crude approximation as abrupt changes fail to be accounted for [3]. Reviewed literature reveals that the use of one model may not be appropriate to model a non-stationary series and as such various parametric and non-parametric change-point estimation methods have been proposed [4] [5] [6] [7] [8]. However, they are limited in different ways and their suitability depend on the underlying assumptions. A point of interest in all aspects of life would be to detect and estimate this changes as their implication is crucial. A time series data containing change points is assumed to be piece-wise stationary implying that some characteristics of the process change abruptly at unknown points in time. Parametric tests for change point are mainly based on the likelihood ratio statistics and estimation based on the maximum likelihood method whose general results can be found in [4]. In its simplest form, change-point detection is the name given to the problem of estimating the point at which the statistical properties of a sequence of observations change [9]. Changing point problems can be classified as off-line which deals with only a fixed sample or on-line which considers new information as it observed. Off-line change point problems deal with fixed sample sizes which are first observed and then detection and estimation of change points are done. [10] introduced the change point problem within the off-line setting. Since this pioneering work, methodologies used for change point detection have been widely researched on with methods extending to techniques for higher order moments within time series data. Change point analysis methods are applicable in a wide range of fields including but not limited to climate (climate change), quality management, medicine, finance, and genetics.

For a given set of data ${x}_{1}\mathrm{,}\cdots \mathrm{,}{x}_{n}$ a change point is said to occur when there exists a time $\tau \in 1,\cdots \mathrm{,}n-1$ such that the statistical properties of ${x}_{1}\mathrm{,}\cdots \mathrm{,}{x}_{\tau}$ and ${x}_{\left(\tau +1\right)}\mathrm{,}\cdots \mathrm{,}{x}_{n}$ are different. If $\tau $ is known then the two samples only need to be compared. However, if $\tau $ is unknown then it has to be analyzed through change point analysis that entails both detection and estimation of the change point/change time. The null hypothesis of no change against the alternative that there exists a time when the distribution characteristics of the series changed is then tested. Considering a change in model parameters the problem would be stated as

$\begin{array}{l}{H}_{0}:{\theta}_{1}={\theta}_{2}=\cdots ={\theta}_{n}={\theta}_{0}\left(\text{unknown}\right)\text{\hspace{1em}}\text{versus}\\ {H}_{1}:{\theta}_{1}=\cdots ={\theta}_{\tau}\ne {\theta}_{\tau +1}=\cdots ={\theta}_{n}\end{array}$ (1)

where $\tau $ is unknown and needs to be estimated.

If $\tau <n$ then the process distribution has changed and $\tau $ is referred to as the change point. Assume that there exists $\lambda \in \left(\mathrm{0,1}\right)$ such that $\tau $ satisfies

$\tau =\lambda n$ (2)

*i.e**.*
$\lambda $ is a fraction that divides the data process at the change point and *n* is the number of observations in a given data set. Then hypothesis 1 can be restated as

$\begin{array}{l}{H}_{0}:\tau =n,\text{\hspace{1em}}\left(\lambda =1\right)\\ {H}_{1}:\tau <n,\text{\hspace{1em}}\left(0<\lambda <1\right)\end{array}$ (3)

At a given level of significance, if the null hypothesis is rejected, then the process X is said to be locally piecewise-stationary and can be approximated by a sequence of stationary processes that may share certain features such as the general functional form of the distribution. With the assumption that change time is unknown, [4] gives eight limiting conditions that yields the null distribution of the likelihood ratio test statistic as the supremum of a standardized Brownian bridge. [11] applied these results within a non-parametric framework and obtained similar results. [12] apply the likelihood ratio test within a parametric framework on assumption that data are drawn from extreme value distributions. Through assumption of the Von Misses condition, their test statistic weakly converges in distribution to the supremum of a squared standardized Brownian bridge.

The rest of this paper is organized as follows: Section 2 gives an overview of the change point estimator based on a the $\varphi $ -divergence. Section 3 provides key results for the limit distribution of the divergence based change point estimator, Section 4 gives some simulation results and finally Section 5 gives the conclusion.

2. Single Change Point Detection and Estimation

The change point problem is addressed by using a ‘distance’ function between distributions. Given a distance function, a test statistic is constructed to guarantee a distance (≥0) between any two distributions based on a sample size *n*. Consider a given parametric model
${f}_{\theta}\mathrm{:}\theta \in \Theta $ where
$\Theta $ is the parameter space defined on a data set of size n. Let
${X}_{1}\mathrm{,}\cdots \mathrm{,}{X}_{n}$ be random variables and have probability densities
$f\left(x\mathrm{;}{\theta}_{1}\right)\mathrm{,}\cdots \mathrm{,}f\left(x\mathrm{;}{\theta}_{n}\right)$ with respect to *σ*-finite measure
$\mu $ with
$F\left(x\mathrm{;}\theta \right)$ generating distinct measures if
$\theta \in \Theta $

Definition 2.1 (*ϕ*-divergence). *Let*
${F}_{{\theta}_{1}}$ *and*
${F}_{{\theta}_{2}}$ *be* *two* *probability* *distributions*. *Define* *the* *ϕ-divergence* *between* *the* *two* *distributions* *as*

${D}_{\varphi}\left({F}_{{\theta}_{1}}\mathrm{,}{F}_{{\theta}_{2}}\right)={D}_{\varphi}\left({\theta}_{1}\mathrm{,}{\theta}_{2}\right)$

The broader family of f-divergences ( $\varphi $ -divergences) take the general form

$\begin{array}{c}{D}_{\varphi}\left({F}_{{\theta}_{1}}\mathrm{,}{F}_{{\theta}_{2}}\right)={D}_{\varphi}\left({\theta}_{1}\mathrm{,}{\theta}_{2}\right)={\displaystyle \int \varphi \left(\text{d}{F}_{{\theta}_{1}}/\text{d}{F}_{{\theta}_{2}}\right)\text{d}{F}_{{\theta}_{2}}}\\ ={\displaystyle {\int}_{x}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{f}_{{\theta}_{2}}\left(x\right)\varphi \left(\frac{{f}_{{\theta}_{1}}\left(x\right)}{{f}_{{\theta}_{2}}\left(x\right)}\right)\text{d}\mu \left(x\right)\\ ={\mathbb{E}}_{{\theta}_{2}}\left[\varphi \left(\frac{{f}_{{\theta}_{1}}\left(X\right)}{{f}_{{\theta}_{2}}\left(X\right)}\right)\right],\text{\hspace{1em}}\text{\hspace{1em}}\varphi \in \Phi \end{array}$ (4)

where $\Phi $ is the class of all convex functions $\varphi \left(t\right)$, $t>0$ satisfying $\varphi \left(1\right)=0$. To avoid indeterminate expressions at any point $t=0$, the following assumptions in relation to the functions $\varphi $ involved in the general definition of $\varphi $ -divergence statistics are given in [13].

$\begin{array}{l}0\varphi \left(\frac{0}{0}\right)=0\\ 0\varphi \left(\frac{p}{0}\right)=\underset{u\to \infty}{\mathrm{lim}}\frac{\varphi \left(u\right)}{u}\end{array}$ (5)

Assumption 1. *The* *function*
$\varphi \in \Phi \mathrm{:}\left[\mathrm{0,}\infty \right)\to \left(-\infty \mathrm{,}+\infty \right)$ *is* *convex* *and* *continuous*. *The* *restriction* *on*
$\left[\mathrm{0,}\infty \right)$ *is* *finite*, *twice* *continuously* *differentiable* *with*
$\varphi \left(1\right)={\varphi}^{\prime}\left(1\right)=0$,
${\varphi}^{\u2033}\left(1\right)=1$.

Different choices of $\varphi $ result in many divergences that play important roles in statistics. ${D}_{\varphi}\left({\theta}_{1}\mathrm{,}{\theta}_{2}\right)\ne {D}_{\varphi}\left({\theta}_{2}\mathrm{,}{\theta}_{1}\right)$ hence divergence measures are not distance measures but give some difference between two probability measures hence the term “pseudo-distance”. More generally a divergence measure is a function of two probability density (or distribution) functions, which has non-negative values and takes the value zero only when the two arguments (distributions) are the same. A divergence measure grows larger as two distributions are further apart. Hence, a large divergence implies departure from the null hypothesis.

Based on the divergence 4 a change point estimator can be constructed as;

$D\left(\lambda \right)=\mathrm{arg}\underset{a<\tau <b}{\mathrm{max}}\left(\lambda \left(1-\lambda \right)\right)\frac{2}{{\varphi}^{\u2033}\left(1\right)}{D}_{\varphi}\left({\stackrel{^}{\theta}}_{1},{\stackrel{^}{\theta}}_{2}\right)$ (6)

To test for the possibility of having a change in distribution of *X* it is natural to compare the distribution function of the first
$\tau $ observations to that of the last
$\left(n-\tau \right)$ since the location of the change time is unknown. When
$\tau $ is near the boundary points, an estimation calculated on a correct large number of observations
$\left(n-\tau \right)$ is compared to an estimation from a small number of observations
$\tau $. This may result to an erratic behavior of the test statistic due to instability of the estimators of the parameters [6]. [14] provides the following result.

Theorem 1. *Suppose* *that*
$\lambda $ *maximizes* *the* *test* *statistic* *over*
$\left[\mathrm{0,1}\right]$ *then* *under* *the* *null* *hypothesis*,

$\begin{array}{l}\underset{\lambda \in \left[\epsilon \mathrm{,1}-\epsilon \right]}{{\displaystyle \mathrm{sup}}}D\left(\lambda \right)={O}_{p}\left(1\right)\text{\hspace{1em}}\forall \epsilon \\ \underset{\lambda \in \left[\mathrm{0,1}\right]}{{\displaystyle \mathrm{sup}}}D\left(\lambda \right)\stackrel{p}{\to}\infty \end{array}$ (7)

for proof of theorem 1 see [14].

If $\lambda $ is not bounded away from zero and one the test statistic does not converge in distribution. However, fixed critical values can be obtained for increasing sample sizes when $\lambda $ is bounded away from zero and one and yields significant power gains if the change point is in $\Lambda $. Let $\epsilon >0$ be small enough such that $\lambda \in \left(\epsilon \mathrm{,1}-\epsilon \right)$.

Suppose there exists constants $a\mathrm{,}b$ such that the unique maximum likelihood estimates ${\stackrel{^}{\theta}}_{1}\mathrm{,}{\stackrel{^}{\theta}}_{2}$ exist for all $a\le \tau \le b$. Then the test statistic is maximized over $\lambda $ such that

${D}_{n\tau}=\mathrm{arg}\underset{\epsilon n<\tau <\left(1-\epsilon \right)n}{\mathrm{max}}\left(\frac{\tau}{n}\left(1-\frac{\tau}{n}\right)\right)\frac{2}{{\varphi}^{\u2033}\left(1\right)}{D}_{\varphi}\left({\stackrel{^}{\theta}}_{1},{\stackrel{^}{\theta}}_{2}\right)$ (8)

where $a=\epsilon n$ and $b=\left(1-\epsilon \right)n$

The trimming parameter is usually taken to satisfy $0<\epsilon <0.5$ [15].

Let $N\left(\epsilon \right)=\left\{n\epsilon \mathrm{,}n\epsilon +\mathrm{1,}\cdots \mathrm{,}\left(1-\epsilon \right)n\right\}$ be the set of all values over which the test statistic 8 is maximized. A change time $\tau $ is estimated by the least value of $\tau $ that maximizes the test statistic 8.

$\stackrel{^}{\tau}=\mathrm{min}\left\{\tau :{D}_{n\tau}=\mathrm{arg}\underset{N\left(\epsilon \right)}{\mathrm{max}}\left(\frac{\tau}{n}\left(1-\frac{\tau}{n}\right)\right)\frac{2}{{\varphi}^{\u2033}\left(1\right)}{D}_{\varphi}\left({\stackrel{^}{\theta}}_{1},{\stackrel{^}{\theta}}_{2}\right)\right\}$ (9)

with ${\stackrel{^}{\theta}}_{1}$, ${\stackrel{^}{\theta}}_{2}$ being parameter estimates of ${\theta}_{1}$ and ${\theta}_{2}$ respectively and that they are dependent on the change point $\tau $. ${\stackrel{^}{\theta}}_{1}$ represents the parameter estimates before the change point and ${\stackrel{^}{\theta}}_{2}$ gives the parameter estimates after the change point. The difference between the two estimators ${\stackrel{^}{\theta}}_{1}$, ${\stackrel{^}{\theta}}_{2}$ give an idea of the difference between the two samples hence departure from the null hypothesis.

3. Main Result

Consider a second order Taylor expansion of $D\left({\stackrel{^}{\theta}}_{1}\mathrm{,}{\stackrel{^}{\theta}}_{2}\right)$ about the true parameter values ${\theta}_{1}\mathrm{,}{\theta}_{2}$.

For $i=\mathrm{1,}\cdots \mathrm{,}d$

$\begin{array}{c}D\left({\stackrel{^}{\theta}}_{1},{\stackrel{^}{\theta}}_{2}\right)=D\left({\theta}_{1},{\theta}_{2}\right)+{\displaystyle \underset{i=1}{\overset{d}{\sum}}}\frac{\partial {D}_{\varphi}\left({\theta}_{1},{\theta}_{2}\right)}{\partial {\theta}_{1i}}\left({\stackrel{^}{\theta}}_{1i}-{\theta}_{1i}\right)+{\displaystyle \underset{i=1}{\overset{d}{\sum}}}\frac{\partial {D}_{\varphi}\left({\theta}_{1},{\theta}_{2}\right)}{\partial {\theta}_{2i}}\left({\stackrel{^}{\theta}}_{2i}-{\theta}_{2i}\right)\\ \text{\hspace{0.17em}}+\frac{1}{2}{\displaystyle \underset{i=1}{\overset{d}{\sum}}}{\displaystyle \underset{j=1}{\overset{d}{\sum}}}\frac{{\partial}^{2}{D}_{\varphi}\left({\theta}_{1},{\theta}_{2}\right)}{\partial {\theta}_{1i}\partial {\theta}_{1j}}{\left({\stackrel{^}{\theta}}_{1i}-{\theta}_{1i}\right)}^{\prime}\left({\stackrel{^}{\theta}}_{1j}-{\theta}_{1j}\right)\\ \text{\hspace{0.17em}}+\frac{1}{2}{\displaystyle \underset{i=1}{\overset{d}{\sum}}}{\displaystyle \underset{j=1}{\overset{d}{\sum}}}\frac{{\partial}^{2}{D}_{\varphi}\left({\theta}_{1},{\theta}_{2}\right)}{\partial {\theta}_{2i}\partial {\theta}_{2j}}{\left({\stackrel{^}{\theta}}_{2i}-{\theta}_{2i}\right)}^{\prime}\left({\stackrel{^}{\theta}}_{2j}-{\theta}_{2j}\right)\\ \text{\hspace{0.17em}}+{\displaystyle \underset{i=1}{\overset{d}{\sum}}}{\displaystyle \underset{j=1}{\overset{d}{\sum}}}\frac{{\partial}^{2}{D}_{\varphi}\left({\theta}_{1},{\theta}_{2}\right)}{\partial {\theta}_{1i}\partial {\theta}_{2j}}{\left({\stackrel{^}{\theta}}_{1i}-{\theta}_{1i}\right)}^{\prime}\left({\stackrel{^}{\theta}}_{2j}-{\theta}_{2j}\right)\\ \text{\hspace{0.17em}}+o\left({\Vert {\stackrel{^}{\theta}}_{1}-{\theta}_{1}\Vert}^{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{2}-{\theta}_{2}\Vert}^{2}\right)\end{array}$ (10)

$\begin{array}{l}D\left({\theta}_{1},{\theta}_{2}\right)=0\\ \frac{\partial {D}_{\varphi}\left({\theta}_{1},{\theta}_{2}\right)}{\partial {\theta}_{1i}}={\displaystyle \int}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\varphi}^{\prime}\left(\frac{{f}_{{\theta}_{1}}\left(x\right)}{{f}_{{\theta}_{2}}\left(x\right)}\right)\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1i}}\text{d}\mu \left(x\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}={\displaystyle \int}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\varphi}^{\prime}\left(1\right)\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1i}}\text{d}\mu \left(x\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.05em}}={\varphi}^{\prime}\left(1\right){\displaystyle \int}\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1i}}\text{d}\mu \left(x\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}={\varphi}^{\prime}\left(1\right)\frac{\partial}{\partial {\theta}_{1i}}{\displaystyle \int}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{f}_{{\theta}_{1}}\left(x\right)\text{d}\mu \left(x\right)=0\end{array}$ (11)

This is by assumption 1 and that ${f}_{{\theta}_{1}}\left(x\right)$ is a pdf.

$\begin{array}{c}\frac{\partial {D}_{\varphi}\left({\theta}_{1},{\theta}_{2}\right)}{\partial {\theta}_{2i}}={\displaystyle \int}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\varphi \left(\frac{{f}_{{\theta}_{1}}\left(x\right)}{{f}_{{\theta}_{2}}\left(x\right)}\right)\frac{\partial {f}_{{\theta}_{2}}\left(x\right)}{\partial {\theta}_{2i}}-{f}_{{\theta}_{2}}\left(x\right){\varphi}^{\prime}\left(\frac{{f}_{{\theta}_{1}}\left(x\right)}{{f}_{{\theta}_{2}}\left(x\right)}\right)\left(\frac{{f}_{{\theta}_{1}}\left(x\right)}{{f}_{{\theta}_{2}}{\left(x\right)}^{2}}\right)\\ ={\displaystyle \int}\text{\hspace{0.05em}}\text{\hspace{0.05em}}\varphi \left(1\right)\frac{\partial {f}_{{\theta}_{2}}\left(x\right)}{\partial {\theta}_{2i}}-{\varphi}^{\prime}\left(1\right)=0\end{array}$ (12)

$\begin{array}{l}\frac{{\partial}^{2}{D}_{\varphi}\left({\theta}_{1}{\theta}_{2}\right)}{\partial {\theta}_{1i}{\theta}_{1j}}={\displaystyle \int}\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1i}}{\varphi}^{\u2033}\left(\frac{{f}_{{\theta}_{1}}\left(x\right)}{{f}_{{\theta}_{2}}\left(x\right)}\right)\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1j}}\frac{1}{{f}_{{\theta}_{2}}\left(x\right)}\text{d}\mu \left(x\right)\\ \text{since}\text{\hspace{0.17em}}\text{\hspace{0.05em}}{\theta}_{1}={\theta}_{2}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}={\varphi}^{\u2033}\left(1\right){\displaystyle \int}\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1i}}\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1j}}\frac{1}{{f}_{{\theta}_{1}}\left(x\right)}\text{d}\mu \left(x\right)\mathrm{,}\text{\hspace{1em}}\text{by}\text{\hspace{0.17em}}\text{assumption}\text{\hspace{0.17em}}\text{1}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}={\displaystyle \int}\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1i}}\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1j}}\frac{1}{{f}_{{\theta}_{1}}\left(x\right)}\text{d}\mu (\; x\; )\end{array}$

$\begin{array}{c}\frac{{\partial}^{2}{D}_{\varphi}\left({\theta}_{1}{\theta}_{2}\right)}{\partial {\theta}_{2i}{\theta}_{2j}}={\displaystyle \int}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\varphi}^{\u2033}\left(\frac{{f}_{{\theta}_{2}}\left(x\right)}{{f}_{{\theta}_{1}}\left(x\right)}\right)\frac{\partial {f}_{{\theta}_{2}}\left(x\right)}{\partial {\theta}_{2i}}\frac{\partial {f}_{{\theta}_{2}}\left(x\right)}{\partial {\theta}_{2j}}\frac{1}{{f}_{{\theta}_{1}}\left(x\right)}\text{d}\mu \left(x\right)\\ ={\varphi}^{\u2033}\left(1\right){\displaystyle \int}\frac{\partial {f}_{{\theta}_{2}}\left(x\right)}{\partial {\theta}_{2i}}\frac{\partial {f}_{{\theta}_{2}}\left(x\right)}{\partial {\theta}_{2j}}\frac{1}{{f}_{{\theta}_{1}}\left(x\right)}\text{d}\mu \left(x\right)\\ ={\displaystyle \int}\frac{\partial {f}_{{\theta}_{2}}\left(x\right)}{\partial {\theta}_{2i}}\frac{\partial {f}_{{\theta}_{2}}\left(x\right)}{\partial {\theta}_{2j}}\frac{1}{{f}_{{\theta}_{1}}\left(x\right)}\text{d}\mu \left(x\right)\mathrm{,}\text{\hspace{1em}}\text{by}\text{\hspace{0.17em}}\text{assumption}\text{\hspace{0.17em}}\text{1}\end{array}$ (13)

$\begin{array}{c}\frac{{\partial}^{2}{D}_{\varphi}\left({\theta}_{1}{\theta}_{2}\right)}{\partial {\theta}_{1i}{\theta}_{2j}}={\displaystyle \int}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\varphi}^{\u2033}\left(\frac{{f}_{{\theta}_{1}}\left(x\right)}{{f}_{{\theta}_{2}}\left(x\right)}\right)\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1j}}\frac{\partial {f}_{{\theta}_{2}}\left(x\right)}{\partial {\theta}_{2j}}-\frac{1}{{\left({f}_{{\theta}_{2}}\left(x\right)\right)}^{2}}{f}_{{\theta}_{1}}\left(x\right)\text{d}\mu \left(x\right)\\ =-{\varphi}^{\u2033}\left(1\right){\displaystyle \int}\frac{\partial {f}_{{\theta}_{1}}\left(x\right)}{\partial {\theta}_{1j}}\frac{\partial {f}_{{\theta}_{2}}\left(x\right)}{\partial {\theta}_{2j}}\frac{1}{{f}_{{\theta}_{1}}\left(x\right)}\text{d}\mu \left(x\right)\\ =-\left\{\frac{{\partial}^{2}{D}_{\theta}\left({\theta}_{1}{\theta}_{2}\right)}{\partial {\theta}_{1i}\partial {\theta}_{1j}}\right\}\end{array}$ (14)

By definition of the Fisher information matrix,

$\begin{array}{l}\frac{{\partial}^{2}{D}_{\varphi}\left({\theta}_{1},{\theta}_{2}\right)}{\partial {\theta}_{1i}{\theta}_{1j}}=\frac{{\partial}^{2}{D}_{\varphi}\left({\theta}_{1},{\theta}_{2}\right)}{\partial {\theta}_{2i}{\theta}_{2j}}=I\left(\theta \right)\\ \frac{{\partial}^{2}{D}_{\varphi}\left({\theta}_{1},{\theta}_{2}\right)}{\partial {\theta}_{1i}{\theta}_{2j}}=-I\left(\theta \right)\end{array}$ (15)

Equation (10) reduces to

$\begin{array}{l}\frac{1}{2}{\left({\stackrel{^}{\theta}}_{1}-{\theta}_{1}\right)}^{\prime}I\left({\theta}_{0}\right)\left({\stackrel{^}{\theta}}_{1}-{\theta}_{1}\right)+\frac{1}{2}{\left({\stackrel{^}{\theta}}_{2}-{\theta}_{2}\right)}^{\prime}I\left({\theta}_{0}\right)\left({\stackrel{^}{\theta}}_{2}-{\theta}_{2}\right)\\ -{\left({\stackrel{^}{\theta}}_{1}-{\theta}_{1}\right)}^{\prime}I\left({\theta}_{0}\right)\left({\stackrel{^}{\theta}}_{2}-{\theta}_{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{1}-{\theta}_{1}\Vert}^{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{2}-{\theta}_{2}\Vert}^{2}\right)\end{array}$ (16)

Further,

$\begin{array}{c}\frac{2}{{\varphi}^{\u2033}\left(1\right)}{D}_{\varphi}\left({\stackrel{^}{\theta}}_{1},{\stackrel{^}{\theta}}_{2}\right)={\left({\stackrel{^}{\theta}}_{1}-{\theta}_{1}\right)}^{\prime}I\left({\theta}_{0}\right)\left({\stackrel{^}{\theta}}_{1}-{\theta}_{1}\right)+{\left({\stackrel{^}{\theta}}_{2}-{\theta}_{2}\right)}^{\prime}I\left({\theta}_{0}\right)\left({\stackrel{^}{\theta}}_{2}-{\theta}_{2}\right)\\ \text{\hspace{0.17em}}-2{\left({\stackrel{^}{\theta}}_{1}-{\theta}_{1}\right)}^{\prime}I\left({\theta}_{0}\right)\left({\stackrel{^}{\theta}}_{2}-{\theta}_{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{1}-{\theta}_{1}\Vert}^{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{2}-{\theta}_{2}\Vert}^{2}\right)\\ ={\left({\stackrel{^}{\theta}}_{1}-{\stackrel{^}{\theta}}_{2}\right)}^{\prime}I\left({\theta}_{0}\right)\left({\stackrel{^}{\theta}}_{1}-{\stackrel{^}{\theta}}_{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{1}-{\theta}_{1}\Vert}^{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{2}-{\theta}_{2}\Vert}^{2}\right)\end{array}$ (17)

From Equation (17) we obtain

$\frac{\tau \left(n-\tau \right)}{n}\left\{{\left({\stackrel{^}{\theta}}_{1}-{\stackrel{^}{\theta}}_{2}\right)}^{\prime}\stackrel{^}{I\left({\theta}_{0}\right)}\left({\stackrel{^}{\theta}}_{1}-{\stackrel{^}{\theta}}_{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{1}-{\theta}_{1}\Vert}^{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{2}-{\theta}_{2}\Vert}^{2}\right)\right\}$ (18)

From 8 and Equations (10)-(18) then the test statistic can be expressed as

${D}_{n\tau}=\underset{\tau \in N\left(\epsilon \right)}{\mathrm{max}}\frac{\tau \left(n-\tau \right)}{n}\left\{{\left({\stackrel{^}{\theta}}_{1}-{\stackrel{^}{\theta}}_{2}\right)}^{\prime}\stackrel{^}{I\left({\theta}_{0}\right)}\left({\stackrel{^}{\theta}}_{1}-{\stackrel{^}{\theta}}_{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{1}-{\theta}_{1}\Vert}^{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{2}-{\theta}_{2}\Vert}^{2}\right)\right\}$ (19)

Let,

$\underset{\tau \in N\left(\epsilon \right)}{\mathrm{max}}\frac{\tau \left(n-\tau \right)}{n}{\left({\stackrel{^}{\theta}}_{1}-{\stackrel{^}{\theta}}_{2}\right)}^{\prime}\stackrel{^}{I\left({\theta}_{0}\right)}\left({\stackrel{^}{\theta}}_{1}-{\stackrel{^}{\theta}}_{2}\right)={W}_{n\tau}$

Then,

$\underset{\tau \in N\left(\epsilon \right)}{\mathrm{max}}{D}_{n\tau}=\underset{\tau \in N\left(\epsilon \right)}{\mathrm{max}}{W}_{n\tau}+o\left({\Vert {\stackrel{^}{\theta}}_{1}-{\theta}_{1}\Vert}^{2}\right)+o\left({\Vert {\stackrel{^}{\theta}}_{2}-{\theta}_{2}\Vert}^{2}\right)$ (20)

Consider the second and third terms on the RHS.

$\begin{array}{l}{\Vert {\stackrel{^}{\theta}}_{1}-{\theta}_{1}\Vert}^{2}={\left({\stackrel{^}{\theta}}_{1}-{\theta}_{1}\right)}^{\prime}\left({\stackrel{^}{\theta}}_{1}-{\theta}_{1}\right)\\ {\Vert {\stackrel{^}{\theta}}_{2}-{\theta}_{2}\Vert}^{2}={\left({\stackrel{^}{\theta}}_{2}-{\theta}_{2}\right)}^{\prime}\left({\stackrel{^}{\theta}}_{2}-{\theta}_{2}\right)\end{array}$ (21)

for $n\to \infty $

$o\left({\Vert {\stackrel{^}{\theta}}_{1}-{\theta}_{1}\Vert}^{2}\right)={o}_{p}\left(1\right),\text{\hspace{1em}}o\left({\Vert {\stackrel{^}{\theta}}_{2}-{\theta}_{2}\Vert}^{2}\right)={o}_{p}\left(1\right)$ (22)

The distribution of ${D}_{n\tau}$ is similar to that of ${W}_{n\tau}$ since the second and third terms of 20 are ${o}_{p}\left(1\right)$.

The change point estimator is reduced to a trimmed maximal Wald type test statistic. Consider the following conditions [16] [17].

(C1) Regularity: Interchanges of derivative and integral operations be valid so that,

$\mathbb{E}\left\{\frac{\partial}{\partial \theta}\mathrm{log}f\left(x\mathrm{;}\theta \right)\right\}=0$ (23)

$\mathbb{E}\left\{\frac{{\partial}^{2}}{\partial {\theta}^{2}}\mathrm{log}f\left(x\mathrm{;}\theta \right)\right\}=-I\left(\theta \right)$ (24)

(C2) for $i\mathrm{,}j=\mathrm{1,}\cdots \mathrm{,}d$

$\frac{\partial}{\partial {\theta}_{i}}f\left(x\mathrm{;}\theta \right)\mathrm{,}\text{\hspace{1em}}\frac{{\partial}^{2}}{\partial {\theta}_{i}\partial {\theta}_{j}}f\left(x\mathrm{;}\theta \right)$

exist almost everywhere such that

$\left|\frac{\partial}{\partial {\theta}_{i}}f\left(x\mathrm{;}\theta \right)\right|\le {H}_{i}\left(x\right)\mathrm{,}\text{\hspace{1em}}\left|\frac{{\partial}^{2}}{\partial {\theta}_{i}\partial {\theta}_{j}}f\left(x\mathrm{;}\theta \right)\right|\le {G}_{ij}(\; x\; )$

where

${\int}_{{\mathbb{R}}^{d}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{H}_{i}\left(x\right)<\infty \mathrm{,}\text{\hspace{1em}}{\displaystyle {\int}_{{\mathbb{R}}^{d}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{G}_{ij}\left(x\right)<\infty $

*i.e**.* the first and second partial derivatives of
$f\left(x\mathrm{;}\theta \right)$ with respect to
$\theta $ are bounded by functions with finite integrals.

(C3) for $i\mathrm{,}j=\mathrm{1,}\cdots \mathrm{,}d$

$\frac{\partial}{\partial {\theta}_{i}}\mathrm{log}f\left(x\mathrm{;}\theta \right)\mathrm{,}\text{\hspace{1em}}\frac{{\partial}^{2}}{\partial {\theta}_{i}\partial {\theta}_{j}}\mathrm{log}f\left(x\mathrm{;}\theta \right)$

exist almost everywhere and are such that the information matrix $I\left(\theta \right)$ exists and is positive definite throughout $\Theta $ and is continuous in $\theta $. Hence

${I}^{-1}\left(\theta \right)={I}^{-1/2}{\left(\theta \right)}^{\prime}{I}^{-1/2}(\; \theta \; )$

for some non-singular matrix ${I}^{-1/2}(\; \theta \; )$

A positive definite matrix is always non-singular (determinant ≠ 0) and the determinant is always positive implying that the matrix is invertible *i.e.*
${I}^{-1}\left(\theta \right)=\Sigma $ (variance-covariance matrix).

(C4) There are constants $a\mathrm{,}b$ such that we can find unique ${\stackrel{^}{\theta}}_{1}\mathrm{,}{\stackrel{^}{\theta}}_{2}$ for each $a\le \tau \le b$.

(C5) There is an open subset ${\Theta}_{0}\in \Theta \subseteq \mathbb{R}$ containing ${\theta}_{0}$ such that ${g}_{i}\left(x\mathrm{;}\theta \right)\mathrm{,}{g}_{ij}\left(x\mathrm{;}\theta \right)\mathrm{,1}\le i\mathrm{,}j\mathrm{,}k\le d$ exist and are continuous in $\theta $ for all $x\in {\mathbb{R}}^{d}$ and $\theta \in {\Theta}_{0}$.

(C6) as $\delta \to 0$

$\mathbb{E}\left\{\underset{h:\left|h\right|\le \delta}{{\displaystyle \mathrm{sup}}}\left|\frac{{\partial}^{2}}{\partial {\theta}^{2}}\mathrm{log}f\left(x;\theta +h\right)-\frac{{\partial}^{2}}{\partial {\theta}^{2}}\mathrm{log}f\left(x;\theta \right)\right|\right\}={\upsilon}_{\delta}\to 0$ (25)

Theorem 2. *Under* *the* *null* *hypothesis* *and* *that* *conditions* *C1-C6* *hold* *then* *the* *asymptotic* *distribution* *of* *the* *test* *statistic* *is* *given* *by*,

${W}_{n\tau}\stackrel{D}{\to}{S}_{p\mathrm{,}\epsilon}$ (26)

as $n\to \infty $, $p=\mathrm{dim}\theta $. where

${S}_{p,\epsilon}=\underset{t\in \left[\epsilon ,1-\epsilon \right]}{{\displaystyle \mathrm{sup}}}{B}_{p}^{2}(\; t\; )$

for ${B}_{p}\left(t\right)=\frac{1}{{t}^{\left(1/2\right)}{\left(1-t\right)}^{\left(1/2\right)}}{\left\{{\left({W}_{p}\left(t\right)\right)}^{\prime}\left({W}_{p}\left(t\right)\right)\right\}}^{\left(1/2\right)}$ such that ${W}_{p}\left(t\right)$ is a p-dimensional Brownian bridge process.

The following results hold for approximation of the distribution function using the inverted Laplace transformation,

$\begin{array}{l}P\left(\underset{\epsilon <t<1-\epsilon}{{\displaystyle \mathrm{sup}}}{B}_{d}^{2}\left(t\right)\ge u\right)\\ =\frac{1}{\Gamma \left(d/2\right)}{\left(\frac{u}{2}\right)}^{d/2}{\text{e}}^{-u/2}\left(\mathrm{log}\left(\frac{{\left(1-\epsilon \right)}^{2}}{{\epsilon}^{2}}\right)\left(1-\frac{d}{u}\right)+\frac{2}{u}+O\left({u}^{-2}\right)\right)\end{array}$ (27)

for proof of this see [18]. Rather than considering a fixed trimming value (
$\epsilon $ ) for all sample sizes, the approach of [4] [11] is followed such that the trimming parameter is s function of the sample size *n*.

Critical Values

At any given level of significance, the asymptotic critical values of the test can then be estimated. Depending on the dimension of the parameter space (d), the critical values can be estimated. For a bi-variate parameter space *i.e.* *d* = 2,
$\alpha \in \left(\mathrm{10\%,5\%,1\%}\right)$, the asymptotic critical values are presented in Table 1 for different sample sizes such that
$n\in \left\{\mathrm{50,200,500,1000}\right\}$.

Table 1. Asymptotic critical values.

4. Simulation Study

In this section change point estimation for the generalized Pareto distribution is considered. For any given finite set of data, at least one of the following is likely at any given change point *τ*: *ξ* changes by a non-zero quantity; *σ* changes by a non-zero quantity; both *ξ* and *σ* change by non-zero quantities. A simple change point problem can be formulated in the following way;

$\begin{array}{l}{H}_{0}:{X}_{t}~GP\left({\xi}_{1},{\sigma}_{1}\right):\text{for}\text{\hspace{0.17em}}t=1,\cdots ,n\\ {H}_{1}:{X}_{t}~GP\left({\xi}_{1},{\sigma}_{1}\right):\text{for}\text{\hspace{0.17em}}t\le \tau \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{X}_{t}~GP\left({\xi}_{2},{\sigma}_{2}\right):\text{for}\text{\hspace{0.17em}}t>\tau \end{array}$ (28)

where ${\xi}_{1}\ne {\xi}_{2},{\sigma}_{1}\ne {\sigma}_{2}$

Assumption: This work assumes that both parameters of the GP distribution (shape and scale) change at the same time.

Let $\varphi \left(t\right)=-\mathrm{log}\left(t\right)$. The resulting divergence is the Kullback Leibler (KL) divergence. The KL divergence between two GP distributions is given as

${D}_{\left\{KL\right\}}\left({\stackrel{^}{\theta}}_{2};{\stackrel{^}{\theta}}_{1}\right)=\mathrm{log}\left(\frac{{\stackrel{^}{\sigma}}_{1}}{{\stackrel{^}{\sigma}}_{2}}\right)-\left(1+{\stackrel{^}{\xi}}_{2}\right)+\left(\frac{1}{{\stackrel{^}{\sigma}}_{1}}+\frac{{\stackrel{^}{\xi}}_{1}}{{\stackrel{^}{\sigma}}_{1}}\right){{\displaystyle \int}}^{\text{}}{\left(1+\frac{{\stackrel{^}{\xi}}_{2}}{{\stackrel{^}{\sigma}}_{2}}x\right)}^{-\frac{1}{{\stackrel{^}{\xi}}_{2}}}{\left(1+\frac{{\stackrel{^}{\xi}}_{1}}{{\stackrel{^}{\sigma}}_{1}}x\right)}^{-1}$ (29)

The KL divergence is a function of the parameters of two densities (before and after the change point).

The change point estimator thus becomes

${D}_{\tau ,n}={\mathrm{max}}_{\u03f5<\tau <\left(1-\u03f5\right)n}\left(\frac{\tau}{n}\left(1-\frac{\tau}{n}\right)\right)\frac{2}{{\varphi}^{\u2033}\left(1\right)}{D}_{KL}\left({\stackrel{^}{\theta}}_{1};{\stackrel{^}{\theta}}_{2}\right)$ (30)

For the simulation study, the following model is considered under the alternative hypothesis;

$\begin{array}{l}{H}_{1}\mathrm{:}{X}_{t}~GP\left(\mathrm{1,0.1}\right)\text{\hspace{1em}}\text{for}\text{\hspace{0.17em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}t\le \tau \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{X}_{t}~GP\left(\mathrm{3,0.35}\right)\text{\hspace{1em}}\text{for}\text{\hspace{0.17em}}\text{\hspace{0.05em}}t>\tau \end{array}$ (31)

The change point
$\tau $ is fixed at *n*/2 for
$n=100,200,500$ and 1000. For a 5% level of significance the change point is estimated and the results are presented in Figures 1-3. The change point process
${D}_{n,\tau}$ takes a hill shape with the peak being observed at the point where the detected change point lies. To estimate the change time, the estimated critical value is superimposed on the graph and the change points taken as the maximum value exceeding the critical bound (critical value at level of significance). The estimated time of change is also superimposed on the respective plots of the time series data.

Figure 1 shows the evolution of the change point test statistic on the left panel and the time series data on the right panel. The graph of hypothesis testing gives
${D}_{\left\{n\tau \right\}}=24.86981$.
${D}_{\left\{n\tau \right\}}>{C}_{0.05}=12.835$. *H _{0}* is therefore rejected and it is concluded that a change point exists in the time series data. The largest divergence is estimated at
$\stackrel{^}{\tau}=88$ against the actual true change point
$\stackrel{^}{\tau}=100$. Figure 2 of hypothesis testing has
${D}_{\left\{n\tau \right\}}=18.525111$.
${D}_{\left\{n\tau \right\}}>{C}_{0.05}=13.3777$.

Figure 1. Change point test process (left panel) and time series data (right panel) for n = 200 and *τ* = n/2.

Figure 2. Change point test process (left panel) and time series data (right panel) for n = 500 and *τ* = n/2.

Figure 3. Change point test process (left panel) and time series data (right panel) for n = 1000 and *τ* = n/2.

The change point estimator is examined when the true change point is no longer located in the middle of the sample size but is fixed towards the boundary points at *n*/3 for *n* = 200, 500, 1000. The results are shown in Figures 4-6.

Figure 4 shows the evolution of the change point test statistic on the left panel and the time series data on the right panel. The graph of hypothesis testing gives
${D}_{n\tau}=13.9438>{C}_{0.05}=12.835$. *H*_{0} is therefore rejected and it is concluded that a change point exists at the respective level of significance. The largest divergence is estimated at
$\stackrel{^}{\tau}=70$ against the actual true change point
$\tau =66$. Figure 5 of hypothesis testing has
${D}_{n\tau}=16.7829>{C}_{0.05}$. *H*_{0} is therefore rejected and a change point is declared. The largest divergence is estimated at
$\stackrel{^}{\tau}=164$ against the actual true change point of
$\tau =166$. Figure 6 has
${D}_{n\tau}=14.6752$ which rejects *H*_{0} at 5% level of significance. The estimated change point is at
$\stackrel{^}{\tau}=330$ against the true change point
$\tau =333$.

The asymptotic power of the change point test is examined. The most commonly used criteria for checking the optimality of a statistical test involves fixing the false alarm probability (type I error) and maximizing the detection probability (minimizing the type II error). The power of the test at a given level against a particular alternative is defined as the probability of rejecting the null hypothesis when the alternative is actually true.

$p\left(\alpha \right)=P\left({D}_{n\tau}>{C}_{\alpha}|{H}_{1}\right)$ (32)

For power or sample-size computation, not only the distribution of the test statistic under the null hypothesis needs to be obtained but also its distribution under the alternative hypothesis. This is beyond the scope of this work hence reliance on simulation results. ${D}_{n\tau}$ is estimated for 1000 replicates of simulated data using defined sample size $n=500$. The behavior of the test as the change point approaches the data boundary points is analyzed using the power function such that.

Figure 4. Change point test process (left panel) and time series data (right panel) for n = 200 and *τ* = n/3.

Figure 5. Change point test process (left panel) and time series data (right panel) for n = 500 and *τ* = n/3.

Figure 6. Change point test process (left panel) and time series data (right panel) for n = 1000 and *τ* = n/3.

$\stackrel{^}{p}\left(\alpha \right)=\left(1+\#\left({D}_{n\tau}>{C}_{\alpha}\right)\right)/\left(1+N\right)$ (33)

The results in Table 2 indicate that the change point test is most powerful when the change point is located at the middle of the data set and less powerful

Figure 7. The 95% power function for $n=500$.

Table 2. Change point power estimates of the test with *n* = 500.

when the change point is located towards the boundary points. This behavior is attributed to comparing an estimate computed from a small sample (say the first $\tau $ observations) to one computed from a larger sample (say the last $n-\tau $ observations). Small sample sizes result to erratic behavior of the test statistic due to instability of the parameter estimates. This implies that the test is more likely to incorrectly reject a change point when is it located towards the boundary points of any given data set (This is shown in Figure 7 that is the power function graph).

5. Conclusion

In this work, a divergence based (pseudo-distance) estimator has been used for estimating change in the parameters of any given parametric distribution. The change-point estimator is the first point at which there is maximal sample evidence of a change characterized by maximum divergence exceeding a critical bound. By application of the likelihood standard regularity conditions the distribution of the pseudo-metric based estimator is found to converge to that of a Brownian bridge process on a given interval. The distribution of the pseudo-distance based change point process is found to be similar to that of a maximal trimmed Wald-type test statistic under the null hypothesis of no change point. The distribution does not depend on the choice of the function $\varphi $ and this is therefore applicable within a parametric framework when using other choices of the function $\varphi $ for other statistical divergences. Further work can be done on the theoretical power properties of this particular change point estimator.

Acknowledgements

The first author thanks the Pan-African University Institute of Basic Sciences, Technology and Innovation (PAUSTI) for funding this research.

References

[1] Truong, C., Oudre, L., & Vayatis, N. (2018). A Review of Change Point Detection Methods. arXiv preprint arXiv:1801.00718.

[2] Brodsky, E. and Darkhovsky, B.S. (2013) Nonparametric Methods in Change Point Problems. Vol. 243, Springer Science & Business Media, Berlin.

[3] Korkas, K.K. and Fryzlewicz, P. (2017) Multiple Change Point Detection for Non-Stationary Time Series Using Wild Binary Segmentation. Statistica Sinica, 27, 287-311.

https://doi.org/10.5705/ss.202015.0262

[4] Csorgo, M. and Horvárth, L. (1997) Limit Theorems in Change-Point Analysis. Vol. 18, John Wiley & Sons, Hoboken.

[5] Cheng, L., AghaKouchak, A., Gilleland, E. and Katz, R.W. (2014) Non-Stationary Extreme Value Analysis in a Changing Climate. Climate Change, 127, 353-369.

https://doi.org/10.1007/s10584-014-1254-5

[6] Jarusková, D. and Rencová, M. (2008) Analysis of Annual Maximal and Minimal Temperatures for Some European Cities by Change Point Methods. Environmentrics, 19, 221-223.

https://doi.org/10.1002/env.865

[7] Dupuis, D., Sun, Y. and Wang, H.J. (2015) Detecting Change Point in Extremes. Statistics and Its Interface, 8, 19-31.

https://doi.org/10.4310/SII.2015.v8.n1.a3

[8] Dette, H. and Wu, W. (2018) Change Point Analysis in Non-Stationary Processes—A Mass Excess Approach. arXiv: 1801.09874.

[9] Killick, R. and Ekcley, I (2014) An R Package for Change Point Analysis. Journal of Statistical Software, 58, 1-19.

https://doi.org/10.18637/jss.v058.i03

[10] Page, E. (1955) A Test for a Change in a Parameter Occurring at an Unknown Point. Biometrika, 42, 523-527.

https://doi.org/10.1093/biomet/42.3-4.523

[11] Gichuh, A.W. (2008) Nonparametric Change Point Analysis for Bernoulli Random Variables Based on Neural Networks.

[12] Dierckx, G. and Teugels, J.L. (2010) Change Point Analysis of Extremes. Environmentrics, 21, 661-686.

https://doi.org/10.1002/env.1041

[13] Pardo, L. (2018) Statistical Inference Based on Divergence Measures. Chapman and Hall/CRC, New York.

https://doi.org/10.1201/9781420034813

[14] Andrews, D.W. (1993) Tests for Parameter Instability and Structural Change with Unknown Change Point. Econonmentrica, 61, 821-856.

https://doi.org/10.2307/2951764

[15] Horváth, L., Miller, C. and Rice, G. (2019) A New Class of Change Point Test Statistics of Rényi Type. Journal of Business & Economic Statistics, 38, 570-579.

https://doi.org/10.1080/07350015.2018.1537923

[16] Hawkins Jr., D.L. (1983) Sequential Detection Procedures for Autoregressive Processes. Technical Report, North Carolina State University, Raleigh.

[17] Batsidis, A., Martin, N., Pardo, L. and Zogfaros, K. (2016) φ-Divergence Based Procedure for Parametric Change Point Problems. Methodology and Computing in Applied Probability, 18, 21-35.

https://doi.org/10.1007/s11009-014-9398-3

[18] Estrella, A. (2003) Critical Values and P Values of Bessel Processes Distributions: Computation and Application to Structural Break Tests. Econometric Theory, 19, 1128-1143.

https://doi.org/10.1017/S0266466603196107