Variable Selection in Finite Mixture of Time-Varying Regression Models

Show more

1. Introduction

The problem of variable selection in FMR models has been widely discussed [1] [2] [3]. When a response variable $y$ with a finite mixture distribution depends on covariates $x$, we obtain a finite mixture of regression (FMR) model. The FMR model with K components can be given as follows [3]:

$f\left(y\mathrm{;}x\mathrm{,}\theta \right)={\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{k}f\left(y\mathrm{;}{\eta}_{k}\left(x\right)\mathrm{,}{\varphi}_{k}\right)$ (1)

where $y$ is an independent and identically distributed (IID) response and $x$ is a $p\times 1$ vector of covariates. $\pi ={\left({\pi}_{1}\mathrm{,}\cdots \mathrm{,}{\pi}_{k}\right)}^{\text{T}}$ denotes the mixing proportions satisfying $0<{\pi}_{k}<1$, ${\sum}_{k=1}^{K}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{k}=1$. $f\left(y\mathrm{;}{\eta}_{k}\left(x\right)\mathrm{,}{\varphi}_{k}\right)$ is the kth mixture component density. ${\eta}_{k}\left(x\right)=h\left({\alpha}_{k}+{x}^{\text{T}}{\beta}_{k}\right)$ for $k=\mathrm{1,}\cdots \mathrm{,}K$, for a given link function $h(\cdot )$, and a dispersion parameter ${\varphi}_{k}$.

However, in some situations, observations were not independent. As pointed out in [2], in the analysis of the PD data, observations from each patient over time were assumed to be independent to facilitate the analysis and comparison with results from the literature. However, the validity of such assumption may be questionable. Whereupon, we consider a situation that observations were time series.

The generalised autoregressive conditional heteroskedasticity (GARCH) model is widely used in time series analysis. A mixture generalized autoregressive conditional heteroscedastic (MGARCH) model was pointed out in [4]. [5] generalized the MixN-GARCH model by relaxing the assumption of constant mixing weights. Whereupon, we combine the GARCH model and the FMR model to discuss the above problem.

There has been extensive studies about variable selection methods. A recent review of the literature regarding the variable selection problem in FMR models can be found in [6]. There are a general family of penalty functions, including the least absolute shrinkage and selection operator (LASSO), the minimax concave penalty (MCP) and the smoothly clipped absolute deviation (SCAD) in [2] and [7].

The method of the maximum penalized log-likelihood (MPL) estimation is usually the EM algorithm. [8] proposed a new algorithm (block-wise MM) for the MPL estimation of the L-MLR model. It was proved to have some desirable features such as coordinate-wise updates of parameters, monotonicity of the penalized likelihood sequence, and global convergence of the estimates to a stationary point of the penalized loglikelihood function, which are missing in the commonly used approximate-EM algorithm presented in [3].

The rest of the paper is organized as follows: in Section 2, the definition of finite mixture of time-varying regression Models and in Section 3, feature selection methods are discussed. In Section 4, the block-wise MM algorithm for its estimation and the BIC for choosing tuning parameters and components are presented, and the example of the Gaussian distribution is derived. Simulation studies on the performance of the new variable selection methods are then provided in Section 5. In Section 6, analysis of a real data set illustrates the use of the procedure. Finally, conclusions are given in Section 7.

2. Finite Mixture of Time-Varying Regression Models

2.1. Finite Mixture of Autoregression Models

Let $\left\{{y}_{t}\mathrm{;}t=\mathrm{1,}\cdots \mathrm{,}n\right\}$ be a response variable which is a time series. $\left\{{x}_{t}\mathrm{;}t=\mathrm{1,}\cdots \mathrm{,}n\right\}$ is a p-dimensional vector of covariates, and each of them is a time series. For an FM-AR(d) model with K components, the conditional density function for observation t is given as follows:

$f\left({y}_{t}\mathrm{;}{x}_{t}\mathrm{,}\theta \right)={\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{k}f\left({y}_{t}\mathrm{;}{\eta}_{k}\left({x}_{t}\right)\mathrm{,}{\varphi}_{k}\right)\mathrm{,}$ (2)

where

${\eta}_{k}\left({x}_{t}\right)=h\left({\alpha}_{k}+{x}_{t}^{\text{T}}{\beta}_{k1}+{x}_{t-1}^{\text{T}}{\beta}_{k2}+\cdots +{x}_{t-d}^{\text{T}}{\beta}_{kd}\right)\mathrm{,}$ (3)

for $k=\mathrm{1,}\cdots \mathrm{,}K$, for a given link function $h(\cdot )$, and a dispersion parameter ${\phi}_{kt}$.

The master vector of all parameters is given by $\theta ={\left({\pi}^{\text{T}},{\alpha}^{\text{T}},{\varphi}^{\text{T}},{\beta}^{\text{T}}\right)}^{\text{T}}$, with

$\beta =\left(\begin{array}{ccc}{\beta}_{11}& \cdots & {\beta}_{1d}\\ \vdots & \ddots & \vdots \\ {\beta}_{K1}& \cdots & {\beta}_{Kd}\end{array}\right)\mathrm{,}$ (4)

where ${\beta}_{ki}={\left({\beta}_{ki1}\mathrm{,}\cdots \mathrm{,}{\beta}_{kip}\right)}^{\text{T}}\in {\mathbb{R}}^{p}$, $i=\mathrm{1,}\cdots \mathrm{,}d$. Let ${\stackrel{\u02dc}{x}}_{t}=\left({x}_{t}^{\text{T}}\mathrm{,}{x}_{t-1}^{\text{T}}\mathrm{,}\cdots \mathrm{,}{x}_{t-d}^{\text{T}}\right)$, and $\stackrel{\u02dc}{\beta}={\left({\beta}_{k1}\mathrm{,}\cdots \mathrm{,}{\beta}_{kd}\right)}^{\text{T}}$, (3) can be rewrote as ${\eta}_{k}\left({x}_{t}\right)=h\left({\alpha}_{k}+{\stackrel{\u02dc}{x}}_{t}\stackrel{\u02dc}{\beta}\right)$.

2.2. Finite Mixture of GARCH Models

Let $\left\{{y}_{t};t=1,\cdots ,n\right\}$ be a response variable which is a time series. Let $\left\{{x}_{t}\mathrm{;}t=\mathrm{1,}\cdots \mathrm{,}n\right\}$ is a p-dimensional vector of covariates, and each of them is a time series. For some distributions with unequal dispersion parameter ${\varphi}_{k}$, we propose the FM-GARCH models. For an FM-GARCH (d,M,S) model with K components, the conditional density function for observation t is given as follows:

$f\left({y}_{t}\mathrm{;}{x}_{t}\mathrm{,}\theta \right)={\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{k}f\left({y}_{t}\mathrm{;}{\eta}_{k}\left({x}_{t}\right)\mathrm{,}{\varphi}_{kt}\right)\mathrm{,}$ (5)

where ${\eta}_{k}\left({x}_{t}\right)=h\left({\alpha}_{k}+{\stackrel{\u02dc}{x}}_{t}\stackrel{\u02dc}{\beta}\right)$ for $k=\mathrm{1,}\cdots \mathrm{,}K$, for a given link function $h(\cdot )$, and a conditional heteroscedastic (a dispersion parameter)

${\varphi}_{kt}={\gamma}_{0k}+{\displaystyle \underset{m=1}{\overset{M}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\gamma}_{km}{\u03f5}_{k\mathrm{,}t-m}+{\displaystyle \underset{s=1}{\overset{S}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\delta}_{ks}{\varphi}_{k\mathrm{,}t-s}\mathrm{,}$ (6)

where ${\gamma}_{0k}>0$, ${\gamma}_{km}\ge 0$, ${\delta}_{ks}\ge 0$, and ${\u03f5}_{kt}={\varphi}_{kt}{e}_{kt}$, ${e}_{kt}$ is an independent and identically distributed series with mean zero and variance unity.

The master vector of all parameters is given by $\theta ={\left({\pi}^{\text{T}},{\alpha}^{\text{T}},{\gamma}_{0}^{\text{T}},{\beta}^{\text{T}},{\gamma}^{\text{T}},{\delta}^{\text{T}}\right)}^{\text{T}}$, with ${\gamma}_{0}={\left({\gamma}_{01},\cdots ,{\gamma}_{0K}\right)}^{\text{T}}$, $\gamma ={\left({\gamma}_{1},\cdots ,{\gamma}_{K}\right)}^{\text{T}}$, ${\gamma}_{k}={\left({\gamma}_{k1},{\gamma}_{k2},\cdots ,{\gamma}_{kM}\right)}^{\text{T}}$, and $\delta ={\left({\delta}_{1},\cdots ,{\delta}_{K}\right)}^{\text{T}}$, ${\delta}_{k}={\left({\delta}_{k1}\mathrm{,}{\delta}_{k2}\mathrm{,}\cdots \mathrm{,}{\delta}_{kS}\right)}^{\text{T}}$.

3. Feature Selection Method

Let $\left\{\left({x}_{t}\mathrm{,}{y}_{t}\right)\mathrm{;}t=\mathrm{1,}\cdots \mathrm{,}n\right\}$ be a sample of observations from the FM-AR or FM-GARCH model. The quasi-likelihood function of the parameter $\theta $ is given by [9]

${\text{L}}_{n}\left(\theta \right)={\displaystyle \underset{t=1}{\overset{n}{\prod}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}f\left({y}_{t}\mathrm{;}{x}_{t}\mathrm{,}\theta \right)={\displaystyle \underset{t=1}{\overset{n}{\prod}}}\left\{{\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{k}f\left({y}_{t}\mathrm{;}{\eta}_{k}\left({x}_{t}\right)\mathrm{,}{\varphi}_{kt}\right)\right\}\mathrm{.}$ (7)

The log quasi-likelihood function of the parameter $\theta $ is given by

${\mathcal{L}}_{n}\left(\theta \right)={\displaystyle \underset{t=1}{\overset{n}{\sum}}}log{\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{k}f\left({y}_{t}\mathrm{;}{\eta}_{k}\left({x}_{t}\right)\mathrm{,}{\varphi}_{kt}\right)\mathrm{.}$ (8)

When the effect of a component of $x$ is not significant, the corresponding ordinary maximum quasi-likelihood estimate is often close to 0, but not equal to 0. Thus this covariate is not excluded from the model. Inspired by an idea of [2], we estimate $\theta $ by maximizing the penalized log quasi-likelihood function (MPLQ) for the model

${\mathcal{F}}_{n}\left(\theta \right)={\mathcal{L}}_{n}\left(\theta \right)-{\mathcal{P}}_{n}\left(\theta \right)\mathrm{,}$ (9)

with the mixture penalty (or regularization) function:

${\mathcal{P}}_{nk}\left(\theta \right)={\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{k}{\displaystyle \underset{i=1}{\overset{d}{\sum}}}{\displaystyle \underset{j=1}{\overset{p}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{p}_{n}\left({\beta}_{kij};{\lambda}_{nk}\right)+\frac{1}{2}{\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{k}{\displaystyle \underset{i=1}{\overset{d}{\sum}}}{\displaystyle \underset{j=1}{\overset{p}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\upsilon}_{nk}{\beta}_{kij}^{2},$ (10)

for some ridge tuning parameter ${\upsilon}_{nk}\ge 0$, and ${p}_{n}\left({\beta}_{kij}\mathrm{;}{\lambda}_{nk}\right)$ is a nonnegative penalty function. In the penalty function ${\mathcal{P}}_{n}\left(\theta \right)$, the amount of ${l}_{2}$ penalty imposed on the componentwise regression coefficients ${\beta}_{kij}$ ’s are chosen proportional to ${\pi}_{k}$. The functions ${p}_{n}\left({\beta}_{kij}\mathrm{;}{\lambda}_{nk}\right)$ are designed to identify the no significant coefficients ${\beta}_{kij}$ ’s in the mixture components $f\left({y}_{t}\mathrm{;}{\eta}_{i}\left({x}_{t}\right)\mathrm{,}{\varphi}_{kt}\right)$. General regularity conditions about the ${p}_{n}\left({\beta}_{kij}\mathrm{;}{\lambda}_{nk}\right)$ is given in [2] [3].

We estimate the new method using the following well-known penalty (or regularization) functions:

• LASSO penalty: ${p}_{n}\left(\beta \mathrm{;}{\lambda}_{nk}\right)={\lambda}_{nk}\left|\beta \right|$.

• MCP penalty: ${{p}^{\prime}}_{n}\left(\beta \mathrm{;}{\lambda}_{nk}\right)={\left({\lambda}_{nk}-n{b}_{nk}\left|\beta \right|\right)}_{+}$.

• SCAD penalty: ${{p}^{\prime}}_{n}\left(\beta ;{\lambda}_{nk}\right)={\lambda}_{nk}I\left(n\left|\beta \right|<{\lambda}_{nk}\right)+\frac{{\left({a}_{nk}{\lambda}_{nk}-n\left|\beta \right|\right)}_{+}}{{a}_{nk}-1}I\left(n\left|\beta \right|>{\lambda}_{nk}\right)$.

Here, I is the indicative function. The constant
${a}_{nk}\ge 2$ and
${b}_{nk}\ge 0$ pointed in [2], and LASSO tuning parameter
${\lambda}_{nk}\ge 0$, which controls the amount of penalty. The asymptotic properties about these penalty functions can be analogously derived in [3] and [2]. We call the penalty function
${\mathcal{P}}_{nk}\left(\theta \right)$ in (10) constructed from LASSO, MCP, SCAD jointly with the mixed
${L}_{2}$ -norm as MIXLASSO-ML_{2}, MIXMCP-ML_{2}, MIXSCAD-ML_{2} penalties.

4. Numerical Solutions

A new method for maximizing the penalized log-likelihood function is the block-wise Minorization Maximization (MM) algorithm inspired by [8], which is also known as block successive lower-bound maximization (BSLM) algorithm in the language of [10]. At each iteration of the method, the function is maximized with respect to a single block of variables while the rest of the blocks are held fixed. We shall now proceed to describe the general framework of the algorithm.

4.1. Maximization of the Penalized Log-Likelihood Function

We follow the approach of [8] and minorize the $\epsilon $ -approximate of - ${\mathcal{P}}_{n}\left(\theta \right)$ by

${\text{G}}_{1}\left(\theta \mathrm{;}{\theta}^{\left(r\right)}\right)=-\frac{1}{2}{\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{i}{\displaystyle \underset{j=1}{\overset{d}{\sum}}}{\displaystyle \underset{k=1}{\overset{p}{\sum}}}\text{\hspace{0.05em}}{p}_{n}\left(\frac{{\beta}_{ijk}^{2}}{{w}_{ijk}^{\left(r\right)}}\mathrm{;}{\lambda}_{ni}\right)-\frac{1}{2}{\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{i}{\displaystyle \underset{j=1}{\overset{d}{\sum}}}{\displaystyle \underset{k=1}{\overset{p}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\upsilon}_{ni}{\beta}_{ijk}^{2}+{C}_{1}\left({\theta}^{\left(r\right)}\right)\mathrm{,}$ (11)

where ${w}_{ijk}^{\left(r\right)}=\sqrt{{\beta}_{ijk}^{2\left(r\right)}+{\epsilon}^{2}}$, for some $\epsilon >0$, and

${C}_{1}\left({\theta}^{\left(r\right)}\right)=-\frac{{\epsilon}^{2}}{2}{\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{i}{\displaystyle \underset{j=1}{\overset{d}{\sum}}}{\displaystyle \underset{k=1}{\overset{p}{\sum}}}\text{\hspace{0.05em}}{p}_{n}^{-1}\left({w}_{ijk}^{\left(r\right)};{\lambda}_{ni}\right)-\frac{1}{2}{\displaystyle \underset{k=1}{\overset{K}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\pi}_{i}{\displaystyle \underset{j=1}{\overset{d}{\sum}}}{\displaystyle \underset{k=1}{\overset{p}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{p}_{n}\left({w}_{ijk}^{\left(r\right)};{\lambda}_{ni}\right).$ (12)

Moreover, minorize the log quasi-likelihood function ${\mathcal{L}}_{n}\left(\theta \right)$ by

$\begin{array}{c}{\text{G}}_{2}\left(\theta \mathrm{;}{\theta}^{\left(r\right)}\right)={\displaystyle \underset{k=1}{\overset{K}{\sum}}}{\displaystyle \underset{t=1}{\overset{n}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\tau}_{kt}^{\left(r\right)}\mathrm{log}{\pi}_{i}+{\displaystyle \underset{k=1}{\overset{K}{\sum}}}{\displaystyle \underset{t=1}{\overset{n}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\tau}_{kt}^{\left(r\right)}\mathrm{log}f\left({y}_{t};{\eta}_{i}\left({x}_{t}\right),{\varphi}_{kt}\right)\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}-{\displaystyle \underset{k=1}{\overset{K}{\sum}}}{\displaystyle \underset{t=1}{\overset{n}{\sum}}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\tau}_{kt}^{\left(r\right)}\mathrm{log}{\tau}_{kt}^{(r)},\end{array}$ (13)

where ${\tau}_{kt}^{\left(r\right)}={\pi}_{i}^{\left(r\right)}f\left({y}_{t}\mathrm{;}{\eta}_{i}{}^{\left(r\right)}\left({x}_{t}\right)\mathrm{,}{\varphi}_{kt}^{\left(r\right)}\right)/f\left({y}_{t}\mathrm{;}{x}_{t}\mathrm{,}{\theta}^{\left(r\right)}\right)$.

Note that ${\tau}_{kt}^{\left(r\right)}$ and ${\text{G}}_{2}\left(\theta \mathrm{;}{\theta}^{\left(r\right)}\right)$ are analogous to the posterior probability and the expected complete-data log-likelihood function of the expectation-maximization algorithm respectively.

The block-wise MM algorithm maximizes ${\mathcal{F}}_{n}\left(\theta \right)$ iteratively in the following two steps:

• Block-wise Minorization-step. Conditioned on the rth iterate ${\theta}^{\left(r\right)}$, the FM-GARCH model can be block-wise minorized in the coordinates of the parameter components $\pi $, $\alpha $, ${\gamma}_{0}$, $\gamma $, $\delta $, and $\beta $, via the minorizers

${\text{G}}_{\pi}\left(\pi ;\theta \right)={\text{G}}_{2}\left(\pi ,{\alpha}^{\left(r\right)},{\gamma}_{0}^{\left(r\right)},{\beta}^{\left(r\right)},{\gamma}^{\left(r\right)},{\delta}^{\left(r\right)};{\theta}^{\left(r\right)}\right)-{\mathcal{P}}_{n}\left(\pi ,{\beta}^{\left(r\right)}\right),$ (14)

${\text{G}}_{\alpha ,{\gamma}_{0}}\left(\alpha ,{\gamma}_{0};{\theta}^{\left(r\right)}\right)={\text{G}}_{2}\left({\pi}^{\left(r\right)},\alpha ,{\gamma}_{0},{\beta}^{\left(r\right)},{\gamma}^{\left(r\right)},{\delta}^{\left(r\right)};{\theta}^{\left(r\right)}\right)-{\mathcal{P}}_{n}\left({\theta}^{\left(r\right)}\right),$ (15)

${\text{G}}_{\gamma ,\delta}\left(\gamma ,\delta ;{\theta}^{\left(r\right)}\right)={\text{G}}_{2}\left({\pi}^{\left(r\right)},{\alpha}^{\left(r\right)},{\gamma}_{0}^{\left(r\right)},{\beta}^{\left(r\right)},\gamma ,\delta ;{\theta}^{\left(r\right)}\right)-{\mathcal{P}}_{n}\left({\theta}^{\left(r\right)}\right),$ (16)

${\text{G}}_{\beta}\left(\beta ;{\theta}^{\left(r\right)}\right)={\text{G}}_{1}\left({\pi}^{\left(r\right)},\beta ;{\theta}^{\left(r\right)}\right)+{\text{G}}_{2}\left({\pi}^{\left(r\right)},{\alpha}^{\left(r\right)},{\gamma}_{0}^{\left(r\right)},\beta ,{\gamma}^{\left(r\right)},{\delta}^{\left(r\right)};{\theta}^{\left(r\right)}\right),$ (17)

respectively. Similar block-wise minorized can be made for FM-AR model.

• Block-wise Maximization-step. Upon finding the appropriate set of block-wise minorizers of ${\mathcal{F}}_{n}\left(\theta \right)$, we can maximize (14) to compute the $\left(r+1\right)$ th iterate block-wise update of $\pi $. Solving for the appropriate root of the FOC (first-order condition) for the Lagrangian, we can compute the $\left(r+1\right)$ th iterate block-wise update

${\pi}_{k}^{\left(r+1\right)}=\frac{{\displaystyle {\sum}_{t=1}^{n}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\tau}_{kt}^{\left(r\right)}}{{\zeta}^{*}+{z}_{k}},$ (18)

for each k, where ${z}_{k}={\displaystyle {\sum}_{i=1}^{d}}{\displaystyle {\sum}_{j=1}^{p}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{p}_{n}\left({\beta}_{kij};{\lambda}_{ni}\right)+\frac{1}{2}{\displaystyle {\sum}_{i=1}^{d}}{\displaystyle {\sum}_{j=1}^{p}}\text{\hspace{0.05em}}\text{\hspace{0.05em}}{\upsilon}_{ni}{\beta}_{kij}^{2}$, and ${\zeta}^{\mathrm{*}}$ is the unique root of

(19)

in the interval, and.

The block-wise updates for, , , , and can be obtained by solving (15)-(17) via the first-order condition equal to 0.

We now present a example of the Gaussian FM-GARCH model to specify the procedure described above, and give the following Lemma 1 about a useful minorizer for the MPL estimation of the Gaussian FM-GARCH model, which can be found in [11].

Lemma 1 if, then the function satisfy that

(20)

Example 1 We consider the Gaussian FM-GARCH Model,

(21)

where, and.

Here, , and is an independent and identically distributed series with mean zero and variance unity.

According to [8], and using Lemma 1, we can obtain the further minorizer of Gaussian FM-GARCH by

(22)

where, and

The block-wise updates of from Gaussian FM-GARCH Model come from (18), and the block-wise updates for, , and, can be obtained from (15)-(16) via the first-order condition equal to 0. By doing so, we obtain the coordinate-wise updates for, block

(23)

(24)

for each k. Moreover, the coordinate-wise updates for the and block

(25)

(26)

for each k, m, and s. Finally, making the substitute (22) into (17), the coordinate-wise updates for the block

(27)

for each k and, where is the first derivative of (11) with respect to.

Note that (15)-(17) from Gaussian FM-GARCH Model are concave in the alternative parameterization, , , , , and, thus (23)-(27) globally maximize (15)-(17) over the parameter space.

4.2. Selection of Thresholding Parameters and Components

To implement the methods described in Sections 3 and 4.1, we need to select the size of the tuning parameters and, the constant and, for, and components K. The current theory provides some guidance on the order of in [3] and [8] by using generalized cross validation (GCV) and Bayesian Information Criterion (BIC), to ensure the sparsity property. Following the example of [8], we develop a suitable BIC criterion for the FM-AR and FM-GARCH models. Let, and they are chosen one at a time by minimizing

(28)

where is the dimensionality of (i.e. the total number of non-zero regression coefficients in these model), and equal to 3K (FM-AR models) or 5K (for FM-GARCH models).

The Block-wise MM algorithm is iterated until some convergence criterion is met. In this article, we choose to use the absolute convergence criterion, where TOL > 0 is a small tolerance constant from [8]. Based on the discussion above, we summarise our algorithm in 1.

5. Simulated Data Analysis

In this section, we evaluate the performance of the proposed method and algorithm via simulations. We consider the Gaussian FM-AR models and Gaussian FM-GARCH models. Following [2] and [8], we used the correctly estimated zero coefficients (S1), correctly estimated non-zero coefficients (S2) and the mean estimate over all falsely identified non-zero predictors (). The selection of thresholding parameters and components are solving by using Simulated Annealing (SA) algorithm. All simulations were evaluated with varying values of dimension p with 100 repetitions done for each.

5.1. Simulated Data Analysis of Gaussian FM-AR

The first simulations are based on the Gaussian FM-AR (2) model. Assuming that K is known, the model for the simulation was a and model of

(29)

where, , , , , , and. Columns of are drawn from a multivariate normal, with mean 0, variance 1, and two correlation structures:. The regression coefficients are

Table 1 reports the results. We can see that when the dimension p = 100, the S2 in com1 of from MIXSCAD-ML_{2} is 100, however, the S2 in com1 of from MIXLASSO-ML_{2} (S2 = 70.7) and MIXMCP-ML_{2} (S2 = 51.3) model are small, which indicates that MIXSCAD-ML_{2} ensures that non-zero coefficients can be correctly identified and some non-zero coefficients in the MIXLASSO-ML_{2} and MIXMCP-ML_{2} model are not estimated. The mean estimate over all falsely identified non-zero predictors () of from MIXSCAD-ML_{2} are between 0.001 and 0.01.

5.2. Simulated Data Analysis of Gaussian FM-GARCH

The second simulations are based on the Gaussian FM-GARCH(2,1,1) model. Also assuming that K is known, the model for the simulation was a, , and model of

(30)

(31)

for, where, , , and, and, and, and., is an independent and identically distributed series with mean zero and variance unity. Columns of are drawn from a multivariate normal, with mean 0, variance 1, and two correlation structures: . The regression coefficients are

Table 1. Summary of MIXLASSO-ML_{2}, MIXMCP-ML_{2} and MIXSCAD-ML_{2}-penalized FM-AR (2) model with BIC method form the simulated scenario. Average correctly estimated zero coefficients (specificity; S_{1}), average correctly estimated non-zero coefficients (sensitivity; S_{1}), and the mean estimate over all incorrectly estimated non-zero coefficients (M_{NZ}) are also reported.

From Table 2, we can see that in all simulations, the value of S1 in com1 and com2 of and from MIXSCAD-ML_{2} are the biggest, which indicates that MIXSCAD-ML_{2} perform better than MIXLASSO-ML_{2} and MIXMCP-ML_{2} in correctly estimated zero coefficients. The mean estimate over all falsely identified non-zero predictors () of from MIXSCAD-ML_{2} is smaller than which from MIXLASSO-ML_{2} and MIXMCP-ML_{2}.

6. Real Data Analysis

In this section, we evaluate the performance of the proposed method and algorithm via the analysis of the behavior of urban vehicular traffic of the city of São Paulo. This data set were collected notable occurrences of traffic in the metropolitan region of São Paulo in the period from 14 to 18 December 2009. This was acquired from the website http://archive.ics.uci.edu/ml/datasets.php. Registered from 7:00 to 20:00 every 30 minutes. It contains 135 observations and 18

Table 2. Summary of MIXLASSO-ML_{2}, MIXMCP-ML_{2} and MIXSCAD-ML_{2}-penalized FM-GARCH(1, 1) model with BIC method form the simulated scenario. Average correctly estimated zero coefficients (specificity; S_{1}), average correctly estimated non-zero coefficients (sensitivity; S_{1}), and the mean estimate over all incorrectly estimated non-zero coefficients (M_{NZ}) are also reported.

variables as well as one response variable. Covariate acronyms are hour (HO), immobilized bus (IB), broken truck (BT), vehicle excess (VE), accident victim (AV), running over (RO), fire vehicles (FV), occurrence involving freight (OIF), incident involving dangerous freight (IIDF), lack of electricity (LOE), fire (FI), point of flooding (POF), manifestations (MA), defect in the network of trolleybuses (DNT), tree on the road (TRR), semaphore off (SO), intermittent Semaphore (IS) and the response is slowness in traffic percent. Consider the effect of date on the behavior of traffic, we add a new variable that is day (DA). Figure 1 shows the heterogeneity of the data set, and the FM-AR or FM-GARCH model is applicable.

The levels of the covariates attributes from FMR, FM-AR (2) and FM-GARCH (2,1,1) with models are given in Table 3. From Table 4, we can see that the MIXSCAD-ML_{2} penalized FM-GARCH (2,1,1) with model had the lowest BIC (622.9) across all analyses, the FM-AR (2) with model being ranked second (), which is lower than the BIC (682.3) of FMR model. The predicted slowness in traffic percent from the FM-GARCH model had a MSE of 1.93 and a regression of 0.90. The predicted slowness in traffic percent from the FM-AR (2) model had a MSE of 2.09 and a regression of 0.89. The predicted slowness in traffic percent from the FMR model had a MSE of 2.41 and a regression of 0.87. These results suggest that the FM-GARCH (2,1,1) model had the smallest MSE and explained the largest proportion of variance for the slowness in traffic percent data. The results of the predicted response from these models are presented in Figure 2.

Figure 1. Density of slowness in traffic percent in the metropolitan region of São Paulo in the period from 14 to 18 December 2009.

Table 3. Summary of FMR, FM-AR and FM-GARCH model with BIC method and MIXLASSO-ML_{2} penality.

Table 4. Summary of the values of BIC, MSE, and adjusted regression (predicted response on observed response) from FMR, FM-AR (2) and FM-GARCH (2,1,1) models.

Figure 2. Summary of predicted and observed slowness in traffic percent in the metropolitan region of São Paulo in the period from 14 to 18 December 2009.

7. Discussion

In this article, we disccused that the modeling of response variable which is time series and with a finite mixture distribution depends on covariates, and the variable selection problem of them. We propose the FM-AR models and FM-GARCH models for modeling data that arise from a heterogeneous population which is time series, and propose a new regularization method (MIXLASSO-ML_{2}, MIXMCP-ML_{2}, MIXSCAD-ML_{2}) for the variable selection in these model, which composed of the mixture of the penalty and penalty proportional to mixing proportions. In addition, we estimate the maximum log quasi-likelihood estimate for the new penalized FM-AR and FM-GARCH model, and derive a general expression for the block-wise minimized maximization (MM) algorithm with better features. The simulation results of Gaussian FM-AR and Gaussian FM-GARCH models and an actual data set illustrate the capability of the methodology and algorithm, and MIXSCAD-ML_{2} is always superior to other penalty methods.

References

[1] McLachlan, G.J. and Peel, D. (2000) Finite Mixture Models. Wiley, New York.

https://doi.org/10.1002/0471721182

[2] Abbas, K. and Shili, L. (2013) Regularization in Finite Mixture of Regression Models with Diverging Number of Parameters. Biometrics, 69, 201-235.

https://doi.org/10.1111/biom.12020

[3] Abbas, K. and Chen, J.H. (2007) Variable Selection in Finite Mixture of Regression Models. Journal of the American Statistical Association, 102, 1025-1038.

https://doi.org/10.1198/016214507000000590

[4] Zhang, Z., Li, W.K. and Yuen, K.C. (2005) On a Mixture GARCH Time-Series Model. Journal of Time Series Analysis, 27, 577-597.

https://doi.org/10.1111/j.1467-9892.2006.00467.x

[5] Haas, M., Krause, J., Paolella, M.S. and Steude, S.C. (2013) Time-Varying Mixture GARCH Models and Asymmetric Volatility. North American Journal of Economics and Finance, 26, 602-623.

https://doi.org/10.1016/j.najef.2013.02.024

[6] Abbas, K. (2011) An Overview of the New Feature Selection Methods in Finite Mixture of Regression Models. Journal of the Iranian Statistical Society, 10, 201-235.

[7] Fan, J.Q. and Li, R.Z. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360.

https://doi.org/10.1198/016214501753382273

[8] Lloyd-Jones, L.R., Nguyen, H.D. and McLachlan, G.J. (2018) A Globally Convergent Algorithm for Lasso-Penalized Mixture of Linear Regression Models. Computational Statistics and Data Analysis, 119, 19-38.

https://doi.org/10.1016/j.csda.2017.09.003

[9] Hwang, S.Y., et al. (2014) Quasilikelihood and Quasi-Maximum Likelihood for GARCH-Type Processes: Estimating Function Approach. Journal of the Korean Statistical Society, 1, 5.

https://doi.org/10.1016/j.jkss.2014.01.005

[10] Razaviyayn, M., Hong, M. and Luo, Z.-Q. (2013) A Unified Convergence Analysis of Block Successive Minimization Methods for Nonsmooth Optimization. SIAM Journal on Optimization, 23, 1126-1153.

https://doi.org/10.1137/120891009

[11] Lange, K. (2013) Optimization. Springer, New York.

https://doi.org/10.1007/978-1-4614-5838-8