Back
 APM  Vol.10 No.1 , January 2020
Adaptive Sparse Group Variable Selection for a Robust Mixture Regression Model Based on Laplace Distribution
Abstract: The traditional estimation of Gaussian mixture model is sensitive to heavy-tailed errors; thus we propose a robust mixture regression model by assuming that the error terms follow a Laplace distribution in this article. And for the variable selection problem in our new robust mixture regression model, we introduce the adaptive sparse group Lasso penalty to achieve sparsity at both the group-level and within-group-level. As numerical experiments show, compared with other alternative methods, our method has better performances in variable selection and parameter estimation. Finally, we apply our proposed method to analyze NBA salary data during the period from 2018 to 2019.

1. Introduction

The mixture regression model is a powerful tool to explain the relationships between the response variable and the covariates when the population is heterogeneous and consists of several homogeneous components, and the early research can trace back to [1]. In 1977, EM algorithm was first proposed by [2] ; it greatly simplified the solution procedure of the mixture regression model. Then the mixture regression model attracted a lot of interest from statisticians; it was widely applied in many fields, such as business, marketing and social sciences. See [3] [4] [5] for example.

Recently, the research about the mixture regression model is becoming more and more detailed. On the one hand, statisticians paid attention to improving the robustness of mixture regression model, [6] used the t-distribution for overcoming the influence of outliers and [7] introduced a robust mixture regression model by assuming the error terms follow a Laplace distribution. Further, Wu et al. [8] dropped any parametric assumption about the error densities and proposed the mixture of quantile regressions model. On the other hand, variable selection became a research hotspot in mixture regression modeling. Khalili and Chen [9] considered a class of penalization methods, including L1-norm penalty, SCAD penalty and Hard penalty. Furthermore, the adaptive Lasso was introduced as a penalty function in [10], and [11] suggested the use of the Lasso-penalized mixture regression model as a screening mechanism in a two-step procedure.

However, the above regularization methods are more focused on individual variable selection. They all ignore the grouping structures which describe the inherent interconnections among predictors. It may lead to inefficient models. In order to achieve both the robustness of mixture regression model and correct identification of group structures, we assume random errors follow a Laplace distribution and consider a situation that covariates have natural grouping structures, where those in the same group are correlated. In this case, variable selection should be conducted at both the group-level and within-group-level; thus we use the adaptive sparse group Lasso [12] as a penalty function of our proposed mixture regression model and adopt EM algorithm to estimate the mixture regression parameters. Moreover, under some mild conditions, we can prove that the maximum penalized log-likelihood estimators are both sparse and n -consistent simultaneously.

The rest of this article is organized as follows. In Section 2, we introduce the robust mixture regression model based on Laplace distribution and adopt the adaptive sparse group Lasso for variable selection. In Section 3, we prove some asymptotic properties for our proposed method. In Section 4, we solve the problem of tuning parameters and components selection. Section 5 conducts a numerical simulation to evaluate the performance of our method. In Section 6, we apply our proposed method to NBA salary data. Finally, the conclusion of this paper is given in Section 7.

2. Model Overview

2.1. Robust Mixture Regression with Laplace Distribution

Let { ( x 1 , y 1 ) , , ( x n , y n ) } be a random sample of observations from the population ( x , y ) , where x = ( x 1 , , x p ) T p is a p-dimensional covariate vector, and y is the response variable which is dependent on corresponding x . Furthermore, for g mixture components, we can say that ( x , y ) follows a mixture linear regression model based on a normal distribution if the conditional density of y given x is

f ( y | x , Ψ ) = j = 1 g π j 2 π σ j 2 exp ( ( y α j x T β j ) 2 2 σ j 2 ) , (1)

where the mixing probabilities satisfy j = 1 g π j = 1 , π j > 0 and the parameter Ψ = ( π 1 , α 1 , β 1 , σ 1 2 , , π g , α g , β g , σ g 2 ) . For the jth mixture component, there are intercept α j , regression coefficients β j = ( β j 1 , , β j p ) T and variance σ j 2 ( σ j > 0 ) .

It is known that the mixture linear regression model is sensitive to outliers or heavy-tailed error distributions, and outliers impact more heavily on the mixture linear regression model than on the usual linear regression model, since outliers not only affect the estimation of the regression parameters, but also possibly totally blur the mixture structure. In order to improve the robustness of the estimation procedure, we introduce a robust mixture regression model with a Laplace distribution

f ( y | x , Ψ ) = j = 1 g π j 2 σ j 2 exp ( 2 | y α j x T β j | σ j ) . (2)

Then we can estimate the unknown parameter Ψ by maximizing the log-likelihood function

L n ( Ψ ) = i = 1 n log j = 1 g π j 2 σ j 2 exp ( 2 | y i α j x i T β j | σ j ) . (3)

2.2. Adaptive Sparse Group Lasso for Variable Selection

Now, we consider a situation that covariates have natural grouping structures and can be divided into K groups as x = ( x [ 1 ] T , , x [ k ] T , , x [ K ] T ) T p by some known rules, where x [ k ] = ( x 1 [ k ] , , x p k [ k ] ) T is a group which contains p k variables and k = 1 K p k = p . Then, the log-likelihood function can be written as

L n ( Ψ ) = i = 1 n log j = 1 g π j 2 σ j 2 exp ( 2 | y i α j k = 1 K x i [ k ] T β j [ k ] | σ j ) , (4)

where β j [ k ] = ( β j 1 [ k ] , , β j p k [ k ] ) T .

In order to exploit the grouping structures of covariates, we apply the adaptive sparse group Lasso (adaSGL) to the robust mixture regression model, the penalized log-likelihood function

F n ( Ψ ) = L n ( Ψ ) P n ( Ψ ) . (5)

Here

P n ( Ψ ) = n j = 1 g π j t = 1 p λ j t 1 | β j t | + n j = 1 g π j k = 1 K λ j k 2 β j [ k ] , (6)

where . represents the Euclidean norm, λ j t 1 = λ j 1 ω j t and λ j k 2 = λ j 2 ξ j k . Moreover, the weights are defined based on the maximum penalized log-likelihood estimator Ψ ˜ when P n ( Ψ ) is a Lasso penalty,

ω j t = | β ˜ j t | 1 , ξ j k = β ˜ j [ k ] 1 . (7)

Next, we follow the approach of Hunter and Li [13] and consider to maximize the ε -approximate penalized log-likelihood function

F n , ε ( Ψ ) = L n ( Ψ ) P n , ε ( Ψ ) . (8)

Here

P n , ε ( Ψ ) = n j = 1 g π j t = 1 p λ j t 1 β j t 2 + ε 2 + n j = 1 g π j k = 1 K λ j k 2 t = 1 p k β j t 2 [ k ] + ε 2 (9)

for some small ε > 0 , and the weights are

ω j t = ( β ˜ j t 2 + ε 2 ) 1 / 2 , ξ j k = ( t = 1 p k β ˜ j t 2 [ k ] + ε 2 ) 1 / 2 . (10)

Following Hunter and Li [13], we can similarly show that | F n , ε ( Ψ ) F n ( Ψ ) | 0 uniformly as ε 0 , over any compact subset of the parameter space.

2.3. EM Algorithm for Robust Mixture Regression

However, the above penalized log-likelihood does not have an explicit maximizer. We introduce an EM algorithm to simplify the computation and denote Z i j as a latent Bernoulli variable such that

Z i j = ( 1, if i th observation ( x i , y i ) is from j th component ; 0, otherwise . (11)

If the complete data set T = ( X , Y , Z ) is observable, the complete log-likelihood function is

L n ( Ψ ; T ) = i = 1 n j = 1 g Z i j log [ π j 2 σ j 2 exp ( 2 | y i α j k = 1 K x i [ k ] T β j [ k ] | σ j ) ] , (12)

where X = ( x 1 , , x n ) T , Y = ( y 1 , , y n ) and Z = ( Z 11 , , Z n g ) .

According to Andrews and Mallows [14], we know that a Laplace distribution can be expressed as a mixture of a normal distribution and another distribution related to the exponential distribution. To be specific, there are latent scale variables V = ( V 1 , , V n ) such that we have the complete log-likelihood function

L n ( Ψ ; D ) = i = 1 n j = 1 g Z i j log { π j V i π σ j 2 exp [ V i 2 ( y i α j k = 1 K x i [ k ] T β j [ k ] ) 2 σ j 2 ] } + i = 1 n j = 1 g Z i j log [ 1 V i 3 exp ( 1 2 V i 2 ) ] , (13)

where D = ( T , V ) . Naturally, we can obtain the penalized complete log-likelihood F n , ε ( Ψ ; D ) = L n ( Ψ ; D ) P n , ε ( Ψ ) .

Suppose that Ψ ( r ) is a parameter estimate for the rth iteration. In E step of EM algorithm, we can get E [ F n , ε ( Ψ ; D ) | S , Ψ ( r ) ] by calculating

τ i j ( r ) = E [ Z i j | S , Ψ ( r ) ] , δ i j ( r ) = E [ V i 2 | S , Ψ ( r ) , Z i j = 1 ] , (14)

where S = ( X , Y ) . And we can show that

τ i j ( r ) = π j ( r ) σ j 1 ( r ) exp ( 2 | y i α j ( r ) k = 1 K x i [ k ] T β j [ k ] ( r ) | σ j 1 ( r ) ) j = 1 g π j ( r ) σ j 1 ( r ) exp ( 2 | y i α j ( r ) k = 1 K x i [ k ] T β j [ k ] ( r ) | σ j 1 ( r ) ) (15)

and

δ i j ( r ) = σ j ( r ) 2 | y i α j ( r ) k = 1 K x i [ k ] T β j [ k ] ( r ) | . (16)

The calculation for δ i j ( r ) follows the same argument as in Phillips [15].

In M step, we will maximize E [ F n , ε ( Ψ ; D ) | S , Ψ ( r ) ] for updating Ψ . Now, we follow the tactic of [16] and find a local quadratic approximation of ψ 2 + ε 2 ,

ψ 2 + ε 2 ψ 0 2 + ε 2 + ψ 2 ψ 0 2 2 ψ 0 2 + ε 2 (17)

in a neighborhood of ψ 0 . Then, we can replace the penalty function P n , ε ( Ψ ) in ( r + 1 ) th iteration by

P ˜ n , ε ( Ψ ; Ψ ( r ) ) = n j = 1 g π j t = 1 p λ j t 1 [ η j t ( r ) + β j t 2 β j t 2 ( r ) 2 η j t ( r ) ] + n j = 1 g π j k = 1 K λ j k 2 [ γ j k ( r ) + t = 1 p k ( β j t 2 [ k ] β j t 2 [ k ] ( r ) ) 2 γ j k ( r ) ] , (18)

where η j t ( r ) = β j t 2 ( r ) + ε 2 and γ j k ( r ) = t = 1 p k β j t 2 [ k ] ( r ) + ε 2 . Similarly, from Lange [17], we have

( y α x T ψ ) 2 1 p t = 1 p [ y α x T ψ 0 p x t ( ψ t ψ 0 t ) ] 2 (19)

in a neighborhood of ψ 0 = ( ψ 01 , , ψ 0 p ) T , where p is the dimensionality of ψ . And we apply (19) to E [ L n , ε ( Ψ ; D ) | S , Ψ ( r ) ] , there is a local approximation Q ( Ψ ; Ψ ( r ) ) of E [ F n , ε ( Ψ ; D ) | S , Ψ ( r ) ]

Q ( Ψ ; Ψ ( r ) ) = i = 1 n j = 1 g τ i j ( r ) δ i j ( r ) p σ j 2 t = 1 p [ y i α j x i T β j ( r ) p x i t ( β j t β j t ( r ) ) ] 2 n 2 j = 1 g π j ( t = 1 p λ j t 1 η j t ( r ) β j t 2 + k = 1 K λ j k 2 γ j k ( r ) t = 1 p k β j t 2 [ k ] ) + i = 1 n j = 1 g τ i j ( r ) ( log π j 1 2 log σ j 2 ) n j = 1 g π j [ t = 1 p λ j t 1 ( η j t ( r ) β j t 2 ( r ) 2 η j t ( r ) ) + k = 1 K λ j k 2 ( γ j k ( r ) t = 1 p k β j t 2 [ k ] ( r ) 2 γ j k ( r ) ) ] i = 1 n j = 1 g τ i j ( r ) ( 1 2 log π + log δ i j ( r ) + 1 2 δ i j ( r ) ) (20)

in a neighborhood of Ψ ( r ) . Note that (20) can be block-wise maximized in the coordinates of the parameter components π , α , β and σ 2 . Here, π = ( π 1 , , π g ) , α = ( α 1 , , α g ) , β = ( β 11 , , β g p ) and σ 2 = ( σ 1 2 , , σ g 2 ) .

Under the constraints that j = 1 g π j = 1 and π j > 0 , we adopt Lagrangian multiplier to update π by solving

Q ( π , α ( r ) , β ( r ) , σ 2 ( r ) ; Ψ ( r ) ) = 0 , (21)

where is the gradient operator, ζ is a positive scalar and 0 is a zero vector. Then we have the set of simultaneous equations

a j / π j b j ζ = 0 , (22)

where a j = i = 1 n τ i j ( r ) and b j = n t = 1 p λ j t 1 η j t ( r ) + n k = 1 K λ j k 2 γ j k ( r ) , for each j.

According to π j = a j / ( b j + ζ ) and j = 1 g π j = 1 , we can obtain the unique root ζ * by solving the equation

j = 1 g a j b j + ζ 1 = 0. (23)

Therefore, the ( r + 1 ) th iterate

π j ( r + 1 ) = a j b j + ζ * . (24)

Furthermore, by solving Q ( π ( r + 1 ) , α , β ( r ) , σ 2 ; Ψ ( r ) ) = 0 , we have the updates

α j ( r + 1 ) = i = 1 n τ i j ( r ) δ i j ( r ) ( y i x i T β j ( r ) ) i = 1 n τ i j ( r ) δ i j ( r ) (25)

and

σ j 2 ( r + 1 ) = 2 i = 1 n τ i j ( r ) δ i j ( r ) ( y i α j ( r + 1 ) x i T β j ( r ) ) 2 i = 1 n τ i j ( r ) . (26)

Similarly, for the parameter β j t in kth group, we obtain the updated formula

β j t ( r + 1 ) = i = 1 n 2 τ i j ( r ) δ i j ( r ) x i t ( y i α j ( r + 1 ) x i T β j ( r ) + p x i t β j t ( r ) ) n π j ( r + 1 ) σ j 2 ( r + 1 ) ( λ j t 1 / η j t ( r ) + λ j k 2 / γ j k ( r ) ) + 2 p i = 1 n τ i j ( r ) δ i j ( r ) x i t 2 (27)

by solving Q ( π ( r + 1 ) , α ( r + 1 ) , β , σ 2 ( r + 1 ) ; Ψ ( r ) ) = 0 .

Based on the above, we propose the following EM algorithm.

1) Choose an initial value Ψ ( 0 ) = ( π 1 ( 0 ) , α 1 ( 0 ) , β 1 ( 0 ) , σ 1 2 ( 0 ) , , π g ( 0 ) , α g ( 0 ) , β g ( 0 ) , σ g 2 ( 0 ) ) .

2) E-Step: at the ( r + 1 ) th iteration, calculate τ i j ( r ) and δ i j ( r ) by (15) and (16).

3) M-Step: at the ( r + 1 ) th iteration, update π j , α j , σ j 2 and β j t by (24), (25), (26) and (27).

4) Repeat E-Step and M-Step until convergence is obtained.

Note that if a perfect least absolute deviation (LAD) fit occurs in EM algorithm, i.e. y i α j ( r ) x i T β j ( r ) 0 for some i, j and r. As a result, δ i j ( r + 1 ) will become very large and numerical instability. In this article, we simply introduce a hard threshold to control the extremely small LAD residuals, δ i j ( r + 1 ) will be assigned a value of 106 when the perfect LAD fit occurs.

2.4. Convergence Analysis

The EM algorithm is iterated until some convergence criterion is met. Let tol be a small tolerance constant and M be the maximum iterations for the proposed algorithm, we believe the algorithm has converged to an ideal state when

| F n , ε ( Ψ ( r + 1 ) ) F n , ε ( Ψ ( r ) ) | < t o l , (28)

or the iterations over the maximum iterations M. See [17] for details regarding the relative merits of convergence criteria.

According to Dempster et al. [2], each iteration of the E step and M step of EM algorithm monotonically non-decreases the objective function (8), i.e. F n , ε ( Ψ ( r + 1 ) ) F n , ε ( Ψ ( r ) ) 0 , for all r 0 . Moreover, Wu [18] proved that if the EM sequence { Ψ ( r ) } coverges to some point Ψ * , Ψ * is a stationary point of (8) under some general conditions for F n , ε ( Ψ ) and E [ F n , ε ( Ψ ; D ) | S , Ψ ( r ) ] . Given the facts above, in this article, we run multiple times from different initializations Ψ ( 0 ) in order to obtain an appropriate limit point.

3. Asymptotic Properties

For the regression coefficient vector β j in jth component, we can separate it into β j = ( β 1 j T , β 2 j T ) T , where β 1 j is the set of non-zero effects and β 2 j is the set of zero effects. Naturally, we decompose the parameter Ψ = ( Ψ 1 , Ψ 2 ) such that Ψ 2 contains all zero effects, namely β 2 j , j = 1 , , g . The true parameter is denoted as Ψ 0 and the elements of Ψ 0 are denoted with a subscript, such as β 0 j t .

For the purpose of easy discussion, we define a n = max j , t { λ j t 1 : β 0 j t 0 } , a n * = max j , k { λ j k 2 : β 0 j [ k ] 0 } , b n = min j , t { λ j t 1 : β 0 j t = 0 } , b n * = min j , k { λ j k 2 : β 0 j [ k ] = 0 } . Furthermore, we let f ( z ; Ψ ) be the joint density function of z = ( x , y ) and Ω be an open parameter space. In order to prove the asymptotic properties of the proposed algorithm, some regularity conditions on the joint distribution of z are also required.

A1. The density f ( z ; Ψ ) has common support in z for all Ψ Ω and f ( z ; Ψ ) is identifiable in Ψ up to a permutation of the components of the mixture.

A2. For each Ψ Ω , the density f ( z ; Ψ ) admits third partial derivatives with respect to Ψ for almost all z .

A3. For each Ψ 0 Ω , there are functions M 1 ( z ) and M 2 ( z ) (possibly depending on Ψ 0 ) such that for Ψ in a neighborhood of N ( Ψ 0 ) ,

| f ( z ; Ψ ) Ψ u | M 1 ( z ) , | 2 f ( z ; Ψ ) Ψ u Ψ v | M 1 ( z ) , | 3 log f ( z ; Ψ ) Ψ u Ψ v Ψ w | M 2 (z)

such that M 1 ( z ) d z < and M 2 ( z ) f ( z ; Ψ ) d z < .

A4. The Fisher information matrix

I ( Ψ ) = E { [ Ψ log f ( z ; Ψ ) ] [ Ψ log f ( z ; Ψ ) ] T }

is finite and positive definite for each Ψ Ω .

Theorem 1. Let z i = ( x i , y i ) , i = 1 , , n , be a random sample from the joint density function f ( z ; Ψ ) that satisfies the regularity conditions A1-A4. Suppose that n a n p 0 and n a n * p 0 , as n 0 , then there is a local maximizer Ψ ^ n of the model (5) for which

Ψ ^ n Ψ 0 = O p { n 1 / 2 } ,

where p represents convergence in probability.

Proof. Let r n = n 1 / 2 . It suffices that for any given ε > 0 , there is a constant M ε such that

lim n P { sup μ = M ε F n ( Ψ 0 + r n μ ) < F n ( Ψ 0 ) } 1 ε . (29)

Now, there is a local maximum in { Ψ 0 + r n μ : μ M ε } with large probability, and this local maximizer Ψ ^ n satisfies Ψ ^ n Ψ 0 = O p ( r n ) . Then we let

Δ n ( μ ) = F n ( Ψ 0 + r n μ ) F n ( Ψ 0 ) = [ L n ( Ψ 0 + r n μ ) L n ( Ψ 0 ) ] [ P n ( Ψ 0 + r n μ ) P n ( Ψ 0 ) ] . (30)

Without loss of generality, we assume that the first d j coefficients of β 0 j are non-zero and the first K j groups contain all non-zero effects of β 0 j , where β 0 j is the true regression coefficient vector in the jth component of the mixture regression model. Hence, we have

P n ( Ψ 0 ) = n j = 1 g π j t = 1 d j λ j t 1 | β 0 j t | + n j = 1 g π j k = 1 K j λ j k 2 β 0 j [ k ] . (31)

Since P n ( Ψ 0 + r n μ ) is a sum of non-negative terms, removing terms corresponding to zero effects makes it smaller,

Δ n ( μ ) L n ( Ψ 0 + r n μ ) L n ( Ψ 0 ) n j = 1 g π j t = 1 d j λ j t 1 [ | β 0 j t + r n μ j t | | β 0 j t | ] n j = 1 g π j k = 1 K j λ j k 2 [ β 0 j [ k ] + r n μ j [ k ] β 0 j [ k ] ] . (32)

By Taylor’s expansion, triangular inequality and arithmetic-geometric mean inequality,

L n ( Ψ 0 + r n μ ) L n ( Ψ 0 ) = n 1 / 2 L n ( Ψ 0 ) T μ 1 2 [ μ T I ( Ψ 0 ) μ ] { 1 + o p ( 1 ) } , | n j = 1 g π j t = 1 d j λ j t 1 [ | β 0 j t + r n μ j t | | β 0 j t | ] | n a n j = 1 g d j μ , | n j = 1 g π j k = 1 K j λ j k 2 [ β 0 j [ k ] + r n μ j [ k ] β 0 j [ k ] ] | n a n * j = 1 g K j μ . (33)

Regularity conditions indicate that L n ( Ψ 0 ) = O p ( n 1 / 2 ) and I ( Ψ 0 ) is positive definite, and it is not difficult to find that the sign of Δ n ( μ ) is completely determined by 1 2 [ μ T I ( Ψ 0 ) μ ] { 1 + o p ( 1 ) } . Therefore, for any given ε > 0 , there is a sufficiently large M ε such that

lim n P { sup μ = M ε Δ n ( μ ) < 0 } > 1 ε , (34)

which implies (29), this completes the proof.

Theorem 2. Suppose that the conditions given in Theorem 1 and g is known, n a n p 0 , n a n * p 0 , n b n p and n b n * p , as n 0 . Then, for any n -consistent maximum penalized log-likelihood estimator Ψ ^ n , we have the following:

1) Sparsity: As n , P ( β ^ 2 j = 0 ) 0 , j = 1 , , g .

2) Asymptotic normality:

n { [ P n ( Ψ 01 ) n + I 1 ( Ψ 01 ) ] ( Ψ ^ 1 Ψ 01 ) + P n ( Ψ 01 ) n } d N ( 0 , I 1 ( Ψ 01 ) ) ,

where d denotes convergence in distribution and I 1 ( Ψ 01 ) is a Fisher information when all zero effects are removed.

Proof. In order to prove the sparsity of Theorem 2, we consider the partition Ψ = ( Ψ 1 , Ψ 2 ) and let ( Ψ ^ 1 , 0 ) is the maximizer of the penalized log-likelihood function F n ( Ψ 1 , 0 ) , which is regarded as a function of Ψ 1 . It suffices to show that in the neighborhood Ψ Ψ 0 = O ( n 1 / 2 ) , there is the probability P ( F n ( Ψ 1 , Ψ 2 ) F n ( Ψ ^ 1 , 0 ) < 0 ) 1 as n . Then we have

F n ( Ψ 1 , Ψ 2 ) F n ( Ψ ^ 1 , 0 ) = [ F n ( Ψ 1 , Ψ 2 ) F n ( Ψ 1 , 0 ) ] + [ F n ( Ψ 1 , 0 ) F n ( Ψ ^ 1 , 0 ) ] [ F n ( Ψ 1 , Ψ 2 ) F n ( Ψ 1 , 0 ) ] . (35)

On the other hand,

F n ( Ψ 1 , Ψ 2 ) F n ( Ψ 1 , 0 ) = [ L n ( Ψ 1 , Ψ 2 ) L n ( Ψ 1 , 0 ) ] [ P n ( Ψ 1 , Ψ 2 ) P n ( Ψ 1 , 0 ) ] . (36)

By the mean value theorem,

L n ( Ψ 1 , Ψ 2 ) L n ( Ψ 1 , 0 ) = [ L n ( Ψ 1 , ξ ) Ψ 2 ] T Ψ 2 (37)

for some ξ Ψ 2 = O ( n 1 / 2 ) . By the mean value theorem and regularity condition A3, we can get

L n ( Ψ 1 , ξ ) Ψ 2 L n ( Ψ 01 , 0 ) Ψ 2 L n ( Ψ 1 , ξ ) Ψ 2 L n ( Ψ 1 , 0 ) Ψ 2 + L n ( Ψ 1 , 0 ) Ψ 2 L n ( Ψ 01 , 0 ) Ψ 2 [ i = 1 n M 1 ( z i ) ] ξ + [ i = 1 n M 1 ( z i ) ] Ψ 1 Ψ 01 = { ξ + Ψ 1 Ψ 01 } O p ( n ) = O p ( n 1 / 2 ) . (38)

Here Ψ 01 is a subvector of Ψ 0 with all zero regression coefficients removed. Regularity conditions imply that L n ( Ψ 01 , 0 ) / Ψ 2 = O p ( n 1 / 2 ) , therefore L n ( Ψ 1 , ξ ) / Ψ 2 = O p ( n 1 / 2 ) . In this case, we have

L n ( Ψ 1 , Ψ 2 ) L n ( Ψ 1 , 0 ) = O p ( n 1 / 2 ) j = 1 g t = d j + 1 p | β j t | (39)

for large n. And for the penalized function P n ( Ψ ) ,

P n ( Ψ 1 , Ψ 2 ) P n ( Ψ 1 , 0 ) j = 1 g π j t = d j + 1 p n b n n | β j t | + j = 1 g π j k = K j + 1 K n b n * n β j [ k ] . (40)

Since n b n and n b n * as n , we have

[ L n ( Ψ 1 , Ψ 2 ) L n ( Ψ 1 , 0 ) ] [ P n ( Ψ 1 , Ψ 2 ) P n ( Ψ 1 , 0 ) ] < 0 (41)

with probability to one as n . This completes the proof of the sparsity.

For the asymptotic normality of Theorem 2, we still use the same argument as in Theorem 1 and consider F n ( Ψ 1 , 0 ) is a function of Ψ 1 , there is a -consistent local maximizer of this function, say, that satisfies

(42)

By the Taylor’s expansion,

(43)

Substituting into (42), we have

(44)

In addition, regularity conditions imply that

(45)

Finally, we can get

(46)

by the Slutsky’s theorem. This completes the proof of the asymptotic normality.

Now, we know that, as long as the conditions, , and are satisfied when, the conclusions of theorem 1 and theorem 2 are tenable. Since the estimator based on the Lasso penalty, it can be -consistent. Then, for any, we have for and for. Based on the fact, we can take which satisfies the and. Similarly, for with for and for, we also take to satisfy the and.

4. Tuning Parameters and Components Selection

In this section, we need to solve two problems. One concerns the number of components g and the other problem is the selection of the tuning parameters. Until now, there is little theoretical support for the selection of these hyper parameters. In former literatures, the cross validation [19] and the generalized cross validation [20] provided some effective guidances for these problems. Grün and Leisch [21] and Nguyen and McLachlan [22] indicated that the Bayesian information criterion (BIC) has a good performance in solving these problems. In this paper, we still use the BIC,

(47)

where the number of non-zero regression coefficients in the jth regression model.

Suppose that there is a set of parameter combinations. For each parameter combination, , we can obtain the parameter estimate by the proposed algorithm, and there is a which depends on corresponding. Finally, we set the best parameter combination for our robust mixture model, where.

5. Numerical Simulation

To quantify the performance of our proposed robust mixture regression model based on adaptive sparse group Lasso (adaSGL-RMR), we design a numerical simulation and generate sample data from the following mixture regression model

(48)

where Z is a component indicator. There are groups and each group consists of 5 covariates, covariates within the same group are correlated, whereas those in different groups are uncorrelated. In order to generate the covariates, we first generate 30 random variables, , independently from, then obtain, , from a multivariate normal distribution with mean zero and. The generation of the covariates as following:

(49)

The model parameters include, , , , , , , ,. The random error and are independent and we consider the follow four cases: 1); 2) Laplace distribution with mean 0 and variance 1; 3), t-distribution with 3 degrees of freedom; 4) a mixture normal distribution.

We use three methods for comparing. The Gaussian mixture model (GMM) based on Lasso penalty (Lasso-GMM), the GMM model based on adaptive Lasso penalty (adaL-GMM) and the adaSGL-RMR model. The fmr package of the R programming language is used to compute the parameter estimates of Lasso-GMM and adaL-GMM.

In this article, the algorithm is terminated when the change in the penalized complete log-likelihood function is less than 10−5 or meets the maximum iterations 105. Furthermore, we adopt a threshold value for in the consideration of practical situation. To be specific, will be assigned a value of 0 if, for some j and t. To evaluate the performances of variable selection and data fitting, we introduce the average number of selected non-zero variables (nvars) without intercepts, average number of selected non-zero groups (ngroups), frequency of correct identification of group sparsity structures (cgroups), false negative rate (FNR) of missing important predictors, false positive rate (FPR) of selecting unimportant predictors and average value of root mean square errors (RMSE). Here,

(50)

and

(51)

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives for each fitted model.

As shown in Table 1, in case (1) through case (4), Lasso-GMM fails to identify non-zero groups and selects too many unimportant predictors, adaL-GMM is inclined to select less unimportant predictors and achieves higher cgroups at the cost of ignoring some important predictors. As a contrast, adaSGL-RMR has a better performance in variable selection and RMSE indicates that the parameter estimates of our algorithm are closer to the true parameters of mixture regression model. Moreover, the simulation results clearly show that adaSGL-RMR still maintain its superiority in identifying the group sparsity structures when the distribution of random errors has a heavy tail. Therefore, no matter from the model complexity or the goodness of fit to the data, our proposed method is more competitive than other methods.

Table 1. Results of Lasso-GMM, adaL-GMM and adaSGL-RMR on 100 replicates,.

6. Real Data Analysis

In this section, we will analyze how NBA players’ performances of regular season affect their salaries. We gather salaries for all players in the NBA from the website https://hoopshype.com/salaries/players/2018-2019, during the period from 2018 to 2019. Performance measures for individuals are gathered from the website https://www.foxsports.com/nba/stats?season=2018 in the 2018-2019 regular season, which include scoring, rebounding, assists and defense statistics. By liminating missing data, we obtain a complete dataset, which contains salaries for 248 NBA players and 27 measures of performance.

These performance measures are divided into five groups and covariates in the same group are correlated. Score: points per game (PPG), points per 48 minutes (PTS/48), field goals made per game (FGM/G), field goal attempts per game (FGA/G), 3 point FG made per game (3FGM/G), 3 point FG attempts per game (3FGA/G), free throws made per game (FTM/G), free throw attempts per game (FTA/G), high point total in a single game (HIGH), points per shot (PPS). Rebound: offensive rebounds per game (ORPG), defensive rebounds per game (DRPG), rebounds per 48 minutes (RPG/48), offensive rebound% (OFF REB%), defensive rebound% (DEF REB%), rebound% (REB%). Assist: assists per game (APG), assists per 48 minutes (AST/48), assist% (AST%), turnovers per game (TPG), turnover% (TO%). Steal: steals per game (SPG), steals per 48 minutes (STL/48), steal% (STL%). Block: blocks per game (BPG), blocks per 48 minutes (BLK/48), block% (BLK%).

The matrix should be column standardized to have mean 0 and variance 1 for avoiding a poor fitting result. Then we use the stepAIC function from R package MASS to realize variable selection of the standard linear model via the BIC, the predicted logged salaries from the stepwise-BIC linear model shows a mean square error (MSE) of 0.60 and adjusted R2 of 0.42, these terrible results motivate us to conduct further research for this problem. The logged salaries histogram shows multi-modality from Figure 1, it is acceptable to use the mixture regression model for predicting the logged salaries.

For comparison, we run multiple analyses that include three sets of starting parameters for each of models. The predicted results from the adaSGL-RMR model (BIC = 625) have a MSE of 0.11 and adjusted R2 of 0.90. The predicted results from the adaSGL-RMR model (BIC = 598) have a MSE of 0.05 and adjusted R2 of 0.95. The predicted results from the adaSGL-RMR model (BIC = 517) have a MSE of 0.04 and adjusted R2 of 0.96. See Table 2 for more details. These results suggest that the adaSGL-RMR model has the smallest MSE and explains the largest proportion of variance for the logged salaries from the 2018/19 NBA regular season. Moreover, from Figure 2, the predicted densities show a good characterization of the multi-modality in the logged salaries for the adaSGL-RMR models, with the stepwise-BIC linear model not being able to model this.

Figure 1. Histogram and density estimate for logged salaries

Table 2. Parameter estimates for NBA salary data.

Figure 2. Summary of densities for predicted and observed logged salaries.

7. Conclusion

In this paper, we propose a robust mixture regression model based on a Laplace distribution and consider the adaptive sparse group Lasso for variable selection. Its oracle properties are proved completely in Section 3. In addition, the numerical simulation and real data application show that our method has better performance in parameter estimation and variable selection than other methods. A limitation of this study is that we only consider the mixture regression model with K no-overlapping groups and ignore the case when there are some overlaps between different groups. In our future work, we will pay more attention to this problem.

Cite this paper: Wang, J. and Ye, W. (2020) Adaptive Sparse Group Variable Selection for a Robust Mixture Regression Model Based on Laplace Distribution. Advances in Pure Mathematics, 10, 39-55. doi: 10.4236/apm.2020.101004.
References

[1]   Quandt, R.E. (1972) A New Approach to Estimating Switching Regressions. Journal of the American Statistical Association, 67, 306-310.
https://doi.org/10.1080/01621459.1972.10482378

[2]   Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B, 39, 1-22.
https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

[3]   Jiang, W.X. and Tanner, M.A. (1999) Hierarchical Mixtures-of-Experts for Exponential Family Regression Models: Approximation and Maximum Likelihood Estimation. The Annals of Statistics, 27, 987-1011.
https://doi.org/10.1214/aos/1018031265

[4]   McLachlan, G.J. and Peel, D. (2000) Finite Mixture Models. Wiley, New York.
https://doi.org/10.1002/0471721182

[5]   Frühwirth-Schnatter, S. (2006) Finite Mixture and Markov Switching Models. Springer, New York.

[6]   Yao, W.X., Wei, Y. and Yu, C. (2014) Robust Mixture Regression Using the t-Distribution. Computational Statistics and Data Analysis, 71, 116-127.
https://doi.org/10.1016/j.csda.2013.07.019

[7]   Song, W.X., Yao, W.X. and Xing, Y.R. (2014) Robust Mixture Regression Model Fitting by Laplace Distribution. Computational Statistics and Data Analysis, 71, 128-137.
https://doi.org/10.1016/j.csda.2013.06.022

[8]   Wu, Q. and Yao, W.X. (2016) Mixtures of Quantile Regressions. Computational Statistics and Data Analysis, 93, 162-176.
https://doi.org/10.1016/j.csda.2014.04.014

[9]   Khalili, A. and Chen, J.H. (2007) Variable Selection in Finite Mixture of Regression Models. Journal of the American Statistical Association, 102, 1025-1038.
https://doi.org/10.1198/016214507000000590

[10]   Städler, N., Bühlmann, P. and van de Geer, S. (2010) L1-Penalization for Mixture Regression Models. TEST, 19, 209-256.
https://doi.org/10.1007/s11749-010-0197-z

[11]   Lloyd-Jones, L.R., Nguyen, H.D. and McLachlan, G.J. (2018) A Globally Convergent Algorithm for Lasso-Penalized Mixture of Linear Regression Models. Computational Statistics and Data Analysis, 119, 19-38.
https://doi.org/10.1016/j.csda.2017.09.003

[12]   Fang, K.N., Wang, X.Y., Zhang, S.W., Zhu, J.P. and Ma, S.G. (2015) Bi-Level Variable Selection via Adaptive Sparse Group Lasso. Journal of Statistical Computation and Simulation, 85, 2750-2760.
https://doi.org/10.1080/00949655.2014.938241

[13]   Hunter, D.R. and Li, R.Z. (2005) Variable Selection Using MM Algorithms. The Annals of Statistics, 33, 1617-1642.
https://doi.org/10.1214/009053605000000200

[14]   Andrews, D.F. and Mallows, C.L. (1974) Scale Mixtures of Normal Distributions. Journal of the Royal Statistical: Series B, 36, 99-102.
https://doi.org/10.1111/j.2517-6161.1974.tb00989.x

[15]   Phillips, R.F. (2002) Least Absolute Deviations Estimation via the EM Algorithm. Statistics and Computing, 12, 281-285.
https://doi.org/10.1023/A:1020759012226

[16]   Fan, J.Q. and Li, R.Z. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360.
https://doi.org/10.1198/016214501753382273

[17]   Lange, K. (2013) Optimization. Springer, New York.
https://doi.org/10.1007/978-1-4614-5838-8

[18]   Wu, C.F.J. (1983) On the Convergence Properties of the EM Algorithm. The Annals of Statistics, 11, 95-103.
https://doi.org/10.1214/aos/1176346060

[19]   Stone, M. (1974) Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society: Series B, 36, 111-133.
https://doi.org/10.1111/j.2517-6161.1974.tb00994.x

[20]   Craven, P. and Wahba, G. (1978) Smoothing Noisy Data with Spline Functions. Numerische Mathematik, 31, 377-403.
https://doi.org/10.1007/BF01404567

[21]   Grün, B. and Leisch, F. (2007) Fitting Finite Mixtures of Generalized Linear Regressions in R. Computational Statistics and Data Analysis, 51, 5247-5252.
https://doi.org/10.1016/j.csda.2006.08.014

[22]   Nguyen, H.D. and McLachlan, G.J. (2016) Laplace Mixture of Linear Experts. Computational Statistics and Data Analysis, 93, 177-191.
https://doi.org/10.1016/j.csda.2014.10.016

 
 
Top