The mixture regression model is a powerful tool to explain the relationships between the response variable and the covariates when the population is heterogeneous and consists of several homogeneous components, and the early research can trace back to . In 1977, EM algorithm was first proposed by  ; it greatly simplified the solution procedure of the mixture regression model. Then the mixture regression model attracted a lot of interest from statisticians; it was widely applied in many fields, such as business, marketing and social sciences. See    for example.
Recently, the research about the mixture regression model is becoming more and more detailed. On the one hand, statisticians paid attention to improving the robustness of mixture regression model,  used the t-distribution for overcoming the influence of outliers and  introduced a robust mixture regression model by assuming the error terms follow a Laplace distribution. Further, Wu et al.  dropped any parametric assumption about the error densities and proposed the mixture of quantile regressions model. On the other hand, variable selection became a research hotspot in mixture regression modeling. Khalili and Chen  considered a class of penalization methods, including L1-norm penalty, SCAD penalty and Hard penalty. Furthermore, the adaptive Lasso was introduced as a penalty function in , and  suggested the use of the Lasso-penalized mixture regression model as a screening mechanism in a two-step procedure.
However, the above regularization methods are more focused on individual variable selection. They all ignore the grouping structures which describe the inherent interconnections among predictors. It may lead to inefficient models. In order to achieve both the robustness of mixture regression model and correct identification of group structures, we assume random errors follow a Laplace distribution and consider a situation that covariates have natural grouping structures, where those in the same group are correlated. In this case, variable selection should be conducted at both the group-level and within-group-level; thus we use the adaptive sparse group Lasso  as a penalty function of our proposed mixture regression model and adopt EM algorithm to estimate the mixture regression parameters. Moreover, under some mild conditions, we can prove that the maximum penalized log-likelihood estimators are both sparse and -consistent simultaneously.
The rest of this article is organized as follows. In Section 2, we introduce the robust mixture regression model based on Laplace distribution and adopt the adaptive sparse group Lasso for variable selection. In Section 3, we prove some asymptotic properties for our proposed method. In Section 4, we solve the problem of tuning parameters and components selection. Section 5 conducts a numerical simulation to evaluate the performance of our method. In Section 6, we apply our proposed method to NBA salary data. Finally, the conclusion of this paper is given in Section 7.
2. Model Overview
2.1. Robust Mixture Regression with Laplace Distribution
Let be a random sample of observations from the population , where is a p-dimensional covariate vector, and is the response variable which is dependent on corresponding . Furthermore, for g mixture components, we can say that follows a mixture linear regression model based on a normal distribution if the conditional density of y given is
where the mixing probabilities satisfy , and the parameter . For the jth mixture component, there are intercept , regression coefficients and variance .
It is known that the mixture linear regression model is sensitive to outliers or heavy-tailed error distributions, and outliers impact more heavily on the mixture linear regression model than on the usual linear regression model, since outliers not only affect the estimation of the regression parameters, but also possibly totally blur the mixture structure. In order to improve the robustness of the estimation procedure, we introduce a robust mixture regression model with a Laplace distribution
Then we can estimate the unknown parameter by maximizing the log-likelihood function
2.2. Adaptive Sparse Group Lasso for Variable Selection
Now, we consider a situation that covariates have natural grouping structures and can be divided into K groups as by some known rules, where is a group which contains variables and . Then, the log-likelihood function can be written as
In order to exploit the grouping structures of covariates, we apply the adaptive sparse group Lasso (adaSGL) to the robust mixture regression model, the penalized log-likelihood function
where represents the Euclidean norm, and . Moreover, the weights are defined based on the maximum penalized log-likelihood estimator when is a Lasso penalty,
Next, we follow the approach of Hunter and Li  and consider to maximize the -approximate penalized log-likelihood function
for some small , and the weights are
Following Hunter and Li , we can similarly show that uniformly as , over any compact subset of the parameter space.
2.3. EM Algorithm for Robust Mixture Regression
However, the above penalized log-likelihood does not have an explicit maximizer. We introduce an EM algorithm to simplify the computation and denote as a latent Bernoulli variable such that
If the complete data set is observable, the complete log-likelihood function is
where , and .
According to Andrews and Mallows , we know that a Laplace distribution can be expressed as a mixture of a normal distribution and another distribution related to the exponential distribution. To be specific, there are latent scale variables such that we have the complete log-likelihood function
where . Naturally, we can obtain the penalized complete log-likelihood .
Suppose that is a parameter estimate for the rth iteration. In E step of EM algorithm, we can get by calculating
where . And we can show that
The calculation for follows the same argument as in Phillips .
In M step, we will maximize for updating . Now, we follow the tactic of  and find a local quadratic approximation of ,
in a neighborhood of . Then, we can replace the penalty function in iteration by
where and . Similarly, from Lange , we have
in a neighborhood of , where p is the dimensionality of . And we apply (19) to , there is a local approximation of
in a neighborhood of . Note that (20) can be block-wise maximized in the coordinates of the parameter components , , and . Here, , , and .
Under the constraints that and , we adopt Lagrangian multiplier to update by solving
where is the gradient operator, is a positive scalar and is a zero vector. Then we have the set of simultaneous equations
where and , for each j.
According to and , we can obtain the unique root by solving the equation
Therefore, the iterate
Furthermore, by solving , we have the updates
Similarly, for the parameter in kth group, we obtain the updated formula
by solving .
Based on the above, we propose the following EM algorithm.
1) Choose an initial value .
2) E-Step: at the iteration, calculate and by (15) and (16).
3) M-Step: at the iteration, update , , and by (24), (25), (26) and (27).
4) Repeat E-Step and M-Step until convergence is obtained.
Note that if a perfect least absolute deviation (LAD) fit occurs in EM algorithm, i.e. for some i, j and r. As a result, will become very large and numerical instability. In this article, we simply introduce a hard threshold to control the extremely small LAD residuals, will be assigned a value of 106 when the perfect LAD fit occurs.
2.4. Convergence Analysis
The EM algorithm is iterated until some convergence criterion is met. Let tol be a small tolerance constant and M be the maximum iterations for the proposed algorithm, we believe the algorithm has converged to an ideal state when
or the iterations over the maximum iterations M. See  for details regarding the relative merits of convergence criteria.
According to Dempster et al. , each iteration of the E step and M step of EM algorithm monotonically non-decreases the objective function (8), i.e. , for all . Moreover, Wu  proved that if the EM sequence coverges to some point , is a stationary point of (8) under some general conditions for and . Given the facts above, in this article, we run multiple times from different initializations in order to obtain an appropriate limit point.
3. Asymptotic Properties
For the regression coefficient vector in jth component, we can separate it into , where is the set of non-zero effects and is the set of zero effects. Naturally, we decompose the parameter such that contains all zero effects, namely , . The true parameter is denoted as and the elements of are denoted with a subscript, such as .
For the purpose of easy discussion, we define , , , . Furthermore, we let be the joint density function of and be an open parameter space. In order to prove the asymptotic properties of the proposed algorithm, some regularity conditions on the joint distribution of are also required.
A1. The density has common support in for all and is identifiable in up to a permutation of the components of the mixture.
A2. For each , the density admits third partial derivatives with respect to for almost all .
A3. For each , there are functions and (possibly depending on ) such that for in a neighborhood of ,
such that and .
A4. The Fisher information matrix
is finite and positive definite for each .
Theorem 1. Let , , be a random sample from the joint density function that satisfies the regularity conditions A1-A4. Suppose that and , as , then there is a local maximizer of the model (5) for which
where represents convergence in probability.
Proof. Let . It suffices that for any given , there is a constant such that
Now, there is a local maximum in with large probability, and this local maximizer satisfies . Then we let
Without loss of generality, we assume that the first coefficients of are non-zero and the first groups contain all non-zero effects of , where is the true regression coefficient vector in the jth component of the mixture regression model. Hence, we have
Since is a sum of non-negative terms, removing terms corresponding to zero effects makes it smaller,
By Taylor’s expansion, triangular inequality and arithmetic-geometric mean inequality,
Regularity conditions indicate that and is positive definite, and it is not difficult to find that the sign of is completely determined by . Therefore, for any given , there is a sufficiently large such that
which implies (29), this completes the proof.
Theorem 2. Suppose that the conditions given in Theorem 1 and g is known, , , and , as . Then, for any -consistent maximum penalized log-likelihood estimator , we have the following:
1) Sparsity: As , , .
2) Asymptotic normality:
where denotes convergence in distribution and is a Fisher information when all zero effects are removed.
Proof. In order to prove the sparsity of Theorem 2, we consider the partition and let is the maximizer of the penalized log-likelihood function , which is regarded as a function of . It suffices to show that in the neighborhood , there is the probability as . Then we have
On the other hand,
By the mean value theorem,
for some . By the mean value theorem and regularity condition A3, we can get
Here is a subvector of with all zero regression coefficients removed. Regularity conditions imply that , therefore . In this case, we have
for large n. And for the penalized function ,
Since and as , we have
with probability to one as . This completes the proof of the sparsity.
For the asymptotic normality of Theorem 2, we still use the same argument as in Theorem 1 and consider is a function of , there is a -consistent local maximizer of this function, say, that satisfies
By the Taylor’s expansion,
Substituting into (42), we have
In addition, regularity conditions imply that
Finally, we can get
by the Slutsky’s theorem. This completes the proof of the asymptotic normality.
Now, we know that, as long as the conditions, , and are satisfied when, the conclusions of theorem 1 and theorem 2 are tenable. Since the estimator based on the Lasso penalty, it can be -consistent. Then, for any, we have for and for. Based on the fact, we can take which satisfies the and. Similarly, for with for and for, we also take to satisfy the and.
4. Tuning Parameters and Components Selection
In this section, we need to solve two problems. One concerns the number of components g and the other problem is the selection of the tuning parameters. Until now, there is little theoretical support for the selection of these hyper parameters. In former literatures, the cross validation  and the generalized cross validation  provided some effective guidances for these problems. Grün and Leisch  and Nguyen and McLachlan  indicated that the Bayesian information criterion (BIC) has a good performance in solving these problems. In this paper, we still use the BIC,
where the number of non-zero regression coefficients in the jth regression model.
Suppose that there is a set of parameter combinations. For each parameter combination, , we can obtain the parameter estimate by the proposed algorithm, and there is a which depends on corresponding. Finally, we set the best parameter combination for our robust mixture model, where.
5. Numerical Simulation
To quantify the performance of our proposed robust mixture regression model based on adaptive sparse group Lasso (adaSGL-RMR), we design a numerical simulation and generate sample data from the following mixture regression model
where Z is a component indicator. There are groups and each group consists of 5 covariates, covariates within the same group are correlated, whereas those in different groups are uncorrelated. In order to generate the covariates, we first generate 30 random variables, , independently from, then obtain, , from a multivariate normal distribution with mean zero and. The generation of the covariates as following:
The model parameters include, , , , , , , ,. The random error and are independent and we consider the follow four cases: 1); 2) Laplace distribution with mean 0 and variance 1; 3), t-distribution with 3 degrees of freedom; 4) a mixture normal distribution.
We use three methods for comparing. The Gaussian mixture model (GMM) based on Lasso penalty (Lasso-GMM), the GMM model based on adaptive Lasso penalty (adaL-GMM) and the adaSGL-RMR model. The fmr package of the R programming language is used to compute the parameter estimates of Lasso-GMM and adaL-GMM.
In this article, the algorithm is terminated when the change in the penalized complete log-likelihood function is less than 10−5 or meets the maximum iterations 105. Furthermore, we adopt a threshold value for in the consideration of practical situation. To be specific, will be assigned a value of 0 if, for some j and t. To evaluate the performances of variable selection and data fitting, we introduce the average number of selected non-zero variables (nvars) without intercepts, average number of selected non-zero groups (ngroups), frequency of correct identification of group sparsity structures (cgroups), false negative rate (FNR) of missing important predictors, false positive rate (FPR) of selecting unimportant predictors and average value of root mean square errors (RMSE). Here,
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives for each fitted model.
As shown in Table 1, in case (1) through case (4), Lasso-GMM fails to identify non-zero groups and selects too many unimportant predictors, adaL-GMM is inclined to select less unimportant predictors and achieves higher cgroups at the cost of ignoring some important predictors. As a contrast, adaSGL-RMR has a better performance in variable selection and RMSE indicates that the parameter estimates of our algorithm are closer to the true parameters of mixture regression model. Moreover, the simulation results clearly show that adaSGL-RMR still maintain its superiority in identifying the group sparsity structures when the distribution of random errors has a heavy tail. Therefore, no matter from the model complexity or the goodness of fit to the data, our proposed method is more competitive than other methods.
Table 1. Results of Lasso-GMM, adaL-GMM and adaSGL-RMR on 100 replicates,.
6. Real Data Analysis
In this section, we will analyze how NBA players’ performances of regular season affect their salaries. We gather salaries for all players in the NBA from the website https://hoopshype.com/salaries/players/2018-2019, during the period from 2018 to 2019. Performance measures for individuals are gathered from the website https://www.foxsports.com/nba/stats?season=2018 in the 2018-2019 regular season, which include scoring, rebounding, assists and defense statistics. By liminating missing data, we obtain a complete dataset, which contains salaries for 248 NBA players and 27 measures of performance.
These performance measures are divided into five groups and covariates in the same group are correlated. Score: points per game (PPG), points per 48 minutes (PTS/48), field goals made per game (FGM/G), field goal attempts per game (FGA/G), 3 point FG made per game (3FGM/G), 3 point FG attempts per game (3FGA/G), free throws made per game (FTM/G), free throw attempts per game (FTA/G), high point total in a single game (HIGH), points per shot (PPS). Rebound: offensive rebounds per game (ORPG), defensive rebounds per game (DRPG), rebounds per 48 minutes (RPG/48), offensive rebound% (OFF REB%), defensive rebound% (DEF REB%), rebound% (REB%). Assist: assists per game (APG), assists per 48 minutes (AST/48), assist% (AST%), turnovers per game (TPG), turnover% (TO%). Steal: steals per game (SPG), steals per 48 minutes (STL/48), steal% (STL%). Block: blocks per game (BPG), blocks per 48 minutes (BLK/48), block% (BLK%).
The matrix should be column standardized to have mean 0 and variance 1 for avoiding a poor fitting result. Then we use the stepAIC function from R package MASS to realize variable selection of the standard linear model via the BIC, the predicted logged salaries from the stepwise-BIC linear model shows a mean square error (MSE) of 0.60 and adjusted R2 of 0.42, these terrible results motivate us to conduct further research for this problem. The logged salaries histogram shows multi-modality from Figure 1, it is acceptable to use the mixture regression model for predicting the logged salaries.
For comparison, we run multiple analyses that include three sets of starting parameters for each of models. The predicted results from the adaSGL-RMR model (BIC = 625) have a MSE of 0.11 and adjusted R2 of 0.90. The predicted results from the adaSGL-RMR model (BIC = 598) have a MSE of 0.05 and adjusted R2 of 0.95. The predicted results from the adaSGL-RMR model (BIC = 517) have a MSE of 0.04 and adjusted R2 of 0.96. See Table 2 for more details. These results suggest that the adaSGL-RMR model has the smallest MSE and explains the largest proportion of variance for the logged salaries from the 2018/19 NBA regular season. Moreover, from Figure 2, the predicted densities show a good characterization of the multi-modality in the logged salaries for the adaSGL-RMR models, with the stepwise-BIC linear model not being able to model this.
Figure 1. Histogram and density estimate for logged salaries
Table 2. Parameter estimates for NBA salary data.
Figure 2. Summary of densities for predicted and observed logged salaries.
In this paper, we propose a robust mixture regression model based on a Laplace distribution and consider the adaptive sparse group Lasso for variable selection. Its oracle properties are proved completely in Section 3. In addition, the numerical simulation and real data application show that our method has better performance in parameter estimation and variable selection than other methods. A limitation of this study is that we only consider the mixture regression model with K no-overlapping groups and ignore the case when there are some overlaps between different groups. In our future work, we will pay more attention to this problem.
 Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B, 39, 1-22.
 Jiang, W.X. and Tanner, M.A. (1999) Hierarchical Mixtures-of-Experts for Exponential Family Regression Models: Approximation and Maximum Likelihood Estimation. The Annals of Statistics, 27, 987-1011.
 Song, W.X., Yao, W.X. and Xing, Y.R. (2014) Robust Mixture Regression Model Fitting by Laplace Distribution. Computational Statistics and Data Analysis, 71, 128-137.
 Khalili, A. and Chen, J.H. (2007) Variable Selection in Finite Mixture of Regression Models. Journal of the American Statistical Association, 102, 1025-1038.
 Lloyd-Jones, L.R., Nguyen, H.D. and McLachlan, G.J. (2018) A Globally Convergent Algorithm for Lasso-Penalized Mixture of Linear Regression Models. Computational Statistics and Data Analysis, 119, 19-38.
 Fang, K.N., Wang, X.Y., Zhang, S.W., Zhu, J.P. and Ma, S.G. (2015) Bi-Level Variable Selection via Adaptive Sparse Group Lasso. Journal of Statistical Computation and Simulation, 85, 2750-2760.
 Fan, J.Q. and Li, R.Z. (2001) Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96, 1348-1360.
 Stone, M. (1974) Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society: Series B, 36, 111-133.
 Grün, B. and Leisch, F. (2007) Fitting Finite Mixtures of Generalized Linear Regressions in R. Computational Statistics and Data Analysis, 51, 5247-5252.