Variable selection in multivariate analysis is a critical step in regression, especially for high-dimensional datasets. It remains a challenge to identify a small fraction of essential predictors from thousands of variables, especially for small sample sizes. Variable selection by way of a sparse approximation of the parsimonious model can enhance the prediction and estimation accuracy by efficiently identifying the subset of essential predictors, reduce model complexity and improve model interpretability. This paper presents some unbiased, sparse and continuous methods for the judicious selection of important predictors, which allow easier interpretation, better prediction, and reduction in the complexity of the model.
This paper utilized a hyperspectral data, collected from vines at the leaf-level and the canopy-level, for a Riesling vineyard. The dataset was obtained by measuring the spectral reflectance, defined as the ratio of backscattered radiance from a surface and the incident radiance on that surface (scaled to 0% - 100%)  , directly over the leaves during the bloom period of growth. These in situ spectral measurements were coupled to the contemporaneous nutrient analysis of the petiole (leaf stem)  , as per Wine Grape Production Guide for Eastern North America  . The goal of that project was to develop vineyard nutrient models (nitrogen, potassium, phosphorous, magnesium, zinc, and boron), with wavelengths in the reflective regime (approximately 350 - 2500 nm) as predictor variables, toward rapid assessment vine nutrient status using remote sensors, such as cameras mounted on Unmanned Aerial Systems (UAS). Examples of similar past studies can be found in Soil Research  . Such an approach would enable growers to rapidly assess vineyard nutrient needs and apply remedial management interventions, e.g., tailored fertilization regimes. However, the approach is only useful if such models are accurate and precise, i.e., consistency and associated model robustness are critical. This specific grapevine dataset has n = 144 observations and p = 986 spectral bands, treated here as predictor variables; to reiterate, the objective was to identify those wavelengths or wavelength regions that are unbiased and precise predictors of a specific nutrient’s level in the plant. To achieve compatibility with 144 observations of predictors, six replicates of each of the response variable of 24 leaf-level spectral samples have been used.
This high dimensional dataset, with more number of variables than the sample size, suffers from the curse of dimensionality, ill-posedness  and multicollinearity as shown in Figure 1. Hence, a thorough analysis of such data requires modern regularization techniques involving simultaneous shrinkage and variable selection.
One popular family of feature selection methods for parametric models is based on the penalized (pseudo-) likelihood approach. These regularization paths for Generalized Linear Models via Coordinate Descent include the Lasso  , the Smoothly Clipped Absolute Deviation  , the Elastic Net  , the Minimax Concave Penalty  , Multi-Step Adaptive Elastic Net  , and related techniques. Fan & Lv  introduced Sure Independence Screening, where the sure screening is achieved by correlation learning. Since the spectral reflectance data have been measured along the continuum of wavelengths from 330 - 2510 nm, at a spectral sampling interval from 1.5 - 2.7 nm, it can be represented by a
Figure 1. Spectral reflectance curves are plotted as a function of wavelength. There is a total of 144 curves, each measured at 986 wavelengths for Riesling variety grapevines during the bloom growth period. These curves represent typical vegetation curves, with absorption features in the blue (450 nm) and red (650 nm) regions due to photosynthesis and high reflectance in the near-infrared (800 - 1400 nm region), due to internal cellular leaf structure. The noisy regions at approximately 1900 nm and 2300 nm (and onward) are due to atmospheric absorption features and are typically omitted from analyses  .
smooth curve belonging to an infinite dimensional space. Since predictors are non-periodic functional data  , we can use spline functions for approximation, which combines faster calculation of polynomials with significantly more flexibility. A fewer number of basis functions are required to achieve B-spline approximation.
Let us consider the traditional multiple linear regression model
where is an input matrix, is the corresponding response vector, is the regression coefficients vector, and is a vector of the residual errors with variance ( ) = . In order to remove the constant term from the regression model, let us standardize the
predictor variables, such that , for  .
Since the grapevine dataset has more predictor variables than the sample size (which causes multicollinearity), we will discuss the various modern regularization techniques, involving simultaneous shrinkage and variable selection, in subsequent sections.
To explore the various variable selection techniques, this paper is organized as follows: In Section 2, we explain the Elastic Net regularization approach with an emphasis on its clever combination of the traditional Ridge and Lasso methods; in Section 3 we introduce the Multi-Step Adaptive Elastic Net, which provides (at least in principle) an improvement over the basic elastic Net, via various modifications of the penalty function; Section 4 deals with the method known as Minimax Concave Penalty, which is yet another technique designed to address the estimation, inference and prediction inherent in complex datasets, like the grapevine data studied in this paper; in Section 5 we discuss various aspects of iterative Sure Independence Screening; Section 6 adopts a conceptually different approach, by using a functional approach to variable selection, including smoothing by basis representation and validation; Section 7 is dedicated to the comparative analysis of the performance of all of the above techniques with the goal of wavelength selection for the Riesling grapevine dataset toward nutrient estimation; and finally, in Section 8, we discuss the advantages and limitations of the various models mentioned above before concluding.
2. Elastic Net
Ridge regression, the oldest and earliest form of regularization, shrinks all coefficients of the predictors towards zero by a uniform (ℓ2 - norm) convex penalty to produce a unique solution. However, Ridge regression typically does not set the coefficients exactly to zero, unless l = ∞  . Indeed, for the scientific problem underlying the grapevine data, it is important to achieve both shrinkage and variable selection. Hence, Ridge regression is not suitable for the high dimensional grapevine dataset. The Ridge regression coefficients are defined as
where is a quadratic loss function (residual sum of squares), is the ith row of X, is the ℓ2 - norm penalty on , and is the tuning (penalty) parameter, which regulates the strength of the penalty.
Lasso, on the other hand, shrinks all coefficients by a constant value (ℓ1 - norm) and typically sets some of them to zero for some appropriately chosen l. It simultaneously achieves continuous shrinkage and automatic variable selection. However, when the multicollinearity is very high, Lasso tends to pick one of the predictors from the cluster in an arbitrary way and then shrink the others to zero  . The grapevine data are highly correlated; the arbitrariness of variable selection, therefore, will yield multiple solutions. Hence, analysis of the grapevine dataset using Lasso may not always yield a unique solution, as needed. The Lasso regression coefficients are defined as
where is the quadratic loss function, is the ith row of the matrix X, is the ℓ1 - norm penalty on ,
which induces sparsity in the solution, and is the tuning parameter.
Naïve Elastic Net (NEN) overcomes these limitations by combining of the Lasso (ℓ1-norm) and Ridge (ℓ2-norm) penalties  .
Let us consider two fixed non-negative tuning parameters: λ1 and λ2, such that the naïve elastic net criterion is
where , and
The NEN estimator is the minimizer of equation  :
Let us consider , then the elastic net estimator is
subject to for some t.
where is the naïve elastic net penalty and . For all , the penalty function is non-differentiable at 0 (like Lasso) but strictly convex (like ridge). Hence, by varying α, we can control the proportion of ℓ1-norm (α = 0) and the ℓ2-norm (α = 1) penalty. The amount of shrinkage to coefficient estimates is controlled by the parameter t ≥ 0.
Initially, the NEN computes the ridge regression coefficients for each fixed tuning parameter λ2 and then uses this coefficient value to acquire shrinkage along the Lasso coefficient paths. This technique of shrinkage increases the bias of the coefficients without substantial reduction in the variance, resulting in an overall increase of the prediction error. This leads to a doubled shrinkage and unnecessary extra bias, in comparison to Ridge regression or Lasso  . Elastic Net
can correct this double shrinkage by multiplying the NEN estimate by :
This type of transformation reverts to ridge shrinkage while retaining the variable selection property of the NEN. Thus, the elastic net is able to improve the prediction accuracy by achieving the automatic variable selection using ℓ1 penalty, while group selection and stabilization of the coefficient paths on random sampling are achieved by ℓ2 penalty. The sparsity of the elastic net increases monotonically from zero to the sparsity of the Lasso solution as α increases from 0 to 1, for a given parameter λ. Hence, the Elastic Net is a better solution for a dataset with a sample size significantly smaller than the number of highly correlated predictors  .
At times, it is advisable to include the entire group of correlated predictors in the model selection, rather than single variable from the group. In such cases, elastic net ensures that the highly correlated variables enter or exit the model together. The presence of the ℓ2-norm ensures a unique minimum by making the loss function strictly convex  .
3. Multi-Step Adaptive Elastic Net
Lasso and elastic-net approaches result in a substantial number of non-zero coefficients with asymptotically non-ignorable bias. The estimation bias of the Lasso can be reduced by choosing the weights such that the variables with larger coefficients have smaller weights than variables with smaller coefficients. To mitigate this bias, the adaptive Lasso uses the weighted penalty approach as given below:
where is a weighting parameter calculated from the data, ,
and γ is a positive constant. are the initial parameters, obtained by ridge regression or least squares.
This approach ensures that the adaptive lasso is able to accomplish more shrinkage, resulting in smaller coefficients. In other words, it executes a varying amount of shrinkage for different variables. Adaptive elastic net is a mixture of the adaptive Lasso and the elastic net. Before calculating the adaptive weights, we estimate the elastic-net , and use a positive constant γ. Using this information, we can calculate adaptive weights:
Now we can estimate the adaptive elastic net by the equation below:
The presence of the ℓ2-norm ensures that the adaptive elastic net is able to overcome the collinearity problem while retaining the consistency in variable selection and asymptotic Gaussian properties of the adaptive Lasso. The use of multi-step estimation achieves higher true positives (true zeroes are estimated as zeroes) for the variable selection by pursuing more iterative steps and using separate tuning parameters for each step. The estimates of multi-step adaptive elastic net (MSA-Enet) approach are given by:
where k = number of iterations (stages). For MSA-Enet, we use k ≥ 3. By considering k = 2, we can estimate the adaptive elastic net, and for k = 1, we obtain the normal elastic-net. We can obtain the values of and by using cross-validation.
4. Minimax Concave Penalty
Convex penalties fail to satisfy all three conditions of sparsity, continuity, and unbiasedness. Hence, they cannot produce true parsimonious models. To overcome these limitations, Fan & Li  ,  and Zhang  introduced new statistical modeling techniques for variable selection based on nonconvex penalties, called Smoothly Clipped Absolute Deviation (SCAD) and Minimax Concave Penalty (MCP), respectively. However, using non-convex penalties for sparsity will yield multiple local minima of the penalized residual sum of squares, without any knowledge about the best estimator. Hence, the authors of SCAD and MCP regression models have emphasized the oracle property of these nonconvex penalties. The Oracle property means selection of the correct subset of predictors and estimation of the non-zero parameters as if the information were known ahead of time, based on some previous investigations and experiences. These nonconvex penalties are initiated at the origin as the ℓ1 penalty (Lasso) until, |x| = λ, and then smoothly relax the penalization rate to zero as the absolute value of the coefficient increases, but differs in the way that the transition takes place. The MCP relaxes the penalization rate immediately, whereas, for the SCAD, the penalization rate remains flat for a while, before decreasing.
Zhang  defined MCP on [0, ∞) by
for γ > 0 and λ > 0. Equation (12) clearly shows that MCP initially applies the ℓ1 penalty (Lasso), but continuously relaxes that penalization rate until, when t > lγ. At this stage, the rate of penalization drops to 0.
MCP minimizes the maximum concavity
subject to the following unbiasedness and features selection:
The MCP achieves . A higher value of regularization parameter 𝛾 ensures reduction in unbiasedness and increase in concavity. According to  , the penalized regression problem using the MCP function is given as:
where . Estimation of the coefficients using MCP depends on the selection of the parameters γ and λ, obtained through cross-validation. For each penalty value of λ > 0, MCP offers a continuum of penalties starting with the ℓ1-norm at γ → ∞ and the ℓ0-norm as γ → 0+  . Selection of γ determines the sparsity of the model.
The convexity of penalties guarantees that the coordinate descent converges to a unique global minimum and is continuous with respect to λ. Convexity ensures good starting values, which in turn, reduces the number of iterations. However, when convexity fails to exist, then may not necessarily be continuous. In other words, a slight variation in the data may significantly change the coefficient estimate. The estimates obtained using non-convex penalty generally have a large variance. Although the unbiasedness and variable selection preclude convex penalties, the MCP provides the sparse convexity to the broadest extent by minimizing the maximum concavity.
For the high-dimensional grapevine dataset, global convexity is neither possible nor relevant. However, the objective function of the grapevine dataset is convex in the local region. The parsimonious solutions of this objective function have smooth coefficient paths with stable coefficients. The tuning parameter gamma ( ) for the MCP controls how fast the penalization rate goes to zero.
5. Sure Independence Screening (SIS)
The coordinate descent algorithm (penalized likelihood) methods fail to conform to the concurrent expectations of computational expediency, statistical accuracy, and algorithmic stability in the extremely high dimensional dataset. In order to overcome this constraint, Fan & Lv  proposed the concept of the sure screening method, based on a component-wise regression that tackles the challenges above. Variable selection through coefficient estimates generally overfits the model; hence, authors utilized the marginal correlations, instead of regression estimates, in order to address the problem of the dimensionality reduction of ultra high dimensional datasets. Since screening does not require inversion of a matrix, this method seems computationally attractive. This correlation screening, called Sure Independence Screening (SIS), relies on the intuition that the predictors are independent and normally distributed. In other words, each variable is independently used as a predictor to decide its usefulness in predicting the response variable.
According to Saldana & Feng  , SIS is a two-stage procedure. It first removes the variables with weak marginal correlation with the response, thus achieving dimensionality reduction p below the sample size n. Then it accomplishes variable selection and parameter estimation simultaneously through a lower dimensional penalized least squares approach, like SCAD or Lasso. Under certain regularity conditions, the independent, sure screening process keeps all of the relevant predictors in the model with a probability approaching one.
Let X be a matrix with dimension n ´ p and be a p-vector of marginal correlations of predictors with the response variable, acquired by component-wise regression, such that
We standardize the matrix X column-wise and rescale vector by the standard deviation of the response.
Let us sort p component-wise magnitudes of the vector in a decreasing order and define a sub-model for any ,
where signifies the integer part of . In this way, we can reduce the dimension of the full model to a sub-model with size . Hu & Lin  suggested screening of variables by ranking the importance of each predictor according to its marginal Pearson correlation with the response variable. SIS uses this marginal information of correlation to achieve variable selection by removing the predictors, which have weak correlation with the response variable  . Component-wise regression is a simple method of dimensionality reduction below the sample size; however, the method may be affected by multicollinearity. Due to multicollinearity, the sample marginal screening can remove concealed essential predictors, which have a significant influence on a response variable but are weakly marginal correlated with it. In other words, some highly correlated unimportant predictors are selected instead of significant variables, which are relatively feebly linked to the response variable. At times, SIS may not pick up significant predictors, which are marginally uncorrelated but jointly correlated with the response variable. Iterative sure independence screening not only overcomes these limitations by using SCAD but also improves variable selection and parameter estimation via penalized likelihood estimation. It makes use of the shared predictors’ information while retaining computational expediency and stability. Since the predictors of the grapevine dataset are not independent, even Iterative SIS, selects a fewer number of predictors than desired.
6. Functional Data Analysis
Data collection technology has been recently developed to measure observations densely sampled over wavelength, space, time, and other continua  . In such cases, the random variables can assume values in an infinite dimensional space, even though only finite numbers of observations are available and it is represented by a set of curves  . Functional data, in turn, are defined as discrete observations of a phenomenon represented by smooth curves. It reflects the dependence structure between neighboring points so that the phenomenon can be evaluated at any point in time. These observed curves and the statistical methods for its analysis are termed functional data and functional data analysis, respectively  .
6.1. Functional Regression Model
When we consider a linear regression, with the response variable y and the predictor xij being a scalar, then the model takes the form:
This linear model fails to capture the smoothness of the X predictor variables with respect to the wavelength. However, if we replace at least one vector of predictor variable observations in the linear equation by a functional predictor xi(t), we obtain a model consisting of an intercept term along with a single functional predictor variable  .
Let be a set of times, then we can discretize each of the n functional predictors xi(t) on this set. Now fit the model:
If the refinement of the selected time is continued, then the summation will approach an integral equation, and we will get a functional linear regression model for the scalar response:
where the functional regression tries to establish a relationship between a scalar outcome yi and random functions xi(t)  .
Here the constant is the intercept term that adjusts for the origin of the response variable. The parameter β is in the infinitely dimensional space of ℓ2 functions (the Hilbert space of all square integral functions over a particular interval)  .
6.2. Smoothing by Basis Representation
When a function belongs to ℓ2 space, it can be represented by a basis of known functions  . B-spline is one such basis representation used to calculate the functional regression between a functional predictor (spectral reflectance) X(t) and the scalar response. It uses a fixed truncated basis expansion with K known basis elements:
The smoothing (or hat) matrix H is symmetric and square.
The degrees of freedom (DF) for functional fit is given by
moreover, the associated degrees of freedom for error is n - DF.
In spline smoothing, the mean squared error (MSE) is a method of assessing the quality of the estimate. We can reduce the MSE by foregoing some bias, which will lower sampling variance thereby smoothing the estimated curve. Since the estimates are expected to vary slightly from one wavelength to another, the process is akin to appropriating information from neighboring data. This expresses our confidence in the consistency of the function x. The sharing of information increases the bias but improves the stability of the estimated curve  . The number of basis functions to calculate the predictive ability of functional data analysis can be selected based on the minimum mean MSE.
The functional approach of smoothing data performs well only when the number K of basis functions is significantly small as compared to the number of observations. Higher values of K will tend to overfit or undersmooth the data  .
7. Comparative Study for Wavelength Selection
A proper choice of selection methods, applied under appropriate conditions, helps to build consistent parsimonious models and estimate coefficients simultaneously for better prediction accuracy.
Here, we compare four regularized and one functional regression methods: elastic net, multi-step adaptive elastic net, minimax concave penalty, sure independence screening, and functional data analysis based on their predictive performance. To select the significant variables using regularized path elastic-net, multi-step adaptive elastic net, minimax concave penalty, and sure independence screening, R packages glmnet, msaenet, ncvreg, and SIS, respectively, have been used. Thereafter the predictive performance has been enhanced using stepwise regression. Functional regression is performed using R package “fda.usc”.
Figure 2 and Figure 3 demonstrate that the best values for the adjusted and predicted R-squared metrics, for all the six nutrients of the highly correlated grapevine data, are achieved using the generalized linear model with elastic net penalty (penalized maximum likelihood). Multi-step adaptive elastic net, which applies data-driven weights to the ℓ1 penalty of the elastic net, reduces the values of adjusted and predicted R-squared, thereby contradicting Xiao & Xu  . The other three methods have mixed results for the various nutrients of the
Figure 2. The adjusted R-squared value of the six nutrients measured using Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis.
Figure 3. The predicted R-squared value of the six nutrients measured using Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis.
grapevine data. Selection of lambda is obtained using 10-fold cross-validation, based on the mean squared error criterion.
Since these grapevine datasets are high dimensional with multicollinearity, statistical inference is possible only by dimensionality reduction through sparse representation. A parsimonious model of significant predictors is selected by reducing the coefficients of unimportant predictors to zero, which improves the estimation accuracy and enhances the model interpretability. The first four models deal with linear regression, while the fifth one uses functional approach.
Elastic Net averages the highly correlated wavelengths, before incorporating the averaged wavelength into the model. The predictive ability of elastic net for this high dimensional grapevine dataset with high multicollinearity is the best among all the methods discussed above  .
It is worth mentioning that the elastic net is also practically desirable because it provides interpretable output in the form of the solution path plot, which helps to visualize the variable selection. Figure 4, below shows an example of the Riesling variety when the response value is Nitrogen.
As an example, analysis of the grapevine dataset reveals that the following wavelengths are important for predicting the Nitrogen nutrient level: 334.3, 347.8, 571.3, 684.6, 756.9, 1434.4, 1826.4, 1858.4, 1872.6, 1893.8, 1903.8, 1906.6, 1912.2, 1928.9, 1934.5, 1942.8, 1956.6, 1962.1, 1994.8, 2355.2, 2371, 2386.7, 2393.3, 2419.7, 2426.2, 2430.5, 2439.1, 2483.6 nm. These values are in accordance to physiological expectations, although future studies could explore specific
Figure 4. Model Coefficient Path using Elastic Net for Nitrogen. This demonstrates how the coefficients of the nutrients enter the model (become non-zero) as lambda changes. Most of the variables have coefficients close to zero, which indicates high collinearity.
absorption features in detail. For instance, Elvidge & Chen  used reflectance spectra from a pinyon pine canopy and identified 674 nm as the most pronounced chlorophyll absorption feature, which is close to 684.6 nm in the list above. Chlorophyll generally is known to have a close relationship to Nitrogen content  . It furthermore follows that, given Nitrogen’s close relationship to chlorophyll content, Nitrogen predictive models would require a number of wavelengths from the near-infrared region  . We thus concluded that our wavelength selection for Nitrogen, as an example, is valid from a vegetation physiological perspective.
Application of a data-driven weighted approach to the ℓ1-penalty for varying amounts of shrinkage at different variables reduces the bias and variance inflation factor. However, in the bargain, the coefficient of a large number of significant variables is reduced to zero, resulting in the poor predictive ability for the multi-step adaptive elastic net.
Convergence and estimation of the coefficient using MCP depend upon the tuning parameters gamma (γ) and lambda (λ). Hence, the choice of tuning parameter fails to capture all of the significant predictors of the highly correlated grapevine dataset, resulting in poor predictive ability.
Sure Independence Screening, based on component-wise regression or equivalently correlation learning, is computationally attractive because this approach does not require matrix inversion. However, SIS is known to select unimportant predictors, which are highly correlated with the important predictors, instead of significant predictors, which are weakly related to the response. Hence, the predictive performance of SIS is adversely affected in the presence of multicollinearity.
The functional approach of smoothing data performs well only when the number K of basis functions is significantly small as compared to the number of observations. The presence of multicollinearity fails to reduce the number of basis functions significantly, based on minimum MSE, which negatively affects the predictive ability of grapevine dataset based on functional data analysis.
There has been a continuous endeavor to enhance the predictive ability of the high dimensional data by refining the coefficient estimates. These modern variable selection techniques generate a sparse model, based on the assumption that the predictor variables are independent. These models yield extremely good predictive accuracy when the assumption of independence is satisfied. However, for a dataset like these hyperspectral reflectance grapevine data, which is highly correlated with a large number of predictors (wavelengths), clustered together, only Elastic Net has the ability to select the groups of correlated variables. This outcome is especially critical to the burgeoning field of precision agriculture, which is making increasing use of such hyperspectral imaging datasets but cannot reach a large sample size through traditional field work (time and monetary constraints). These data are also highly correlated (~97%). In all such cases, the predictive ability of Elastic Net is likely to outperform the other modern variable selection techniques.
We are grateful to Dr. Justine Vanden Heuvel (Cornell University) for her expertise in vineyard physiology, as well as the field teams from Rochester Institute of Technology and Cornell University for their help in collecting field data.