1. Introduction
It is well understood that not all observations in the data set play equal role when fitting a regression model. We occasionally find that a single or small subset of the data exerts a disproportionate influence on the fitted regression model. That is, parameter estimates or prediction may depend more on the influential subset than the majority of the data. Belsley et al. [1] defined an influential observation as one which either individually or together with several other observations has demonstrably large impact on the calculated values of various estimates, than is the case of most of the other observations. Influential observation in either dependent or independent variable can be as a result of data error or other problem, for example, the influential data points in dependent variable can arise from skewness in the independent variable or from differences in the data generation process for small subset of sample. Obviously, outliers which are observations in a data set which appears to be inconsistent with the remainder of other set of data [2] need not be influential observation in affecting the regression Equation [3]. Andrew and Pregibon [4] highlighted the need to find outliers that matter. They stated that it is not all outliers that need to be harmful in the way that they have undue influence on for instance, the estimation of the parameters in the regression model. If not all outliers matter, examining residual alone might not lead to the detection of influential observation. Thus, other ways of detecting influential observations are needed.
Regression diagnostic comprises of a collection of method used in the identification of influential points and multicollinearity [1]. This includes methods of exploratory data analysis for influential points and identification of violation of assumption of least squares. When the assumption of Ordinary Least Squares (OLS) method that the explanatory variables are not linearly correlated is violated, this results to multicollinearity problem and should be controlled before attempting to measure influence [1]. One of the most popular methods of controlling multicollinearity is the use of Ridge Regression (RR) suggested by Hoerl and Kennard [5]. The idea in RR method is to add small positive number (k > 0) to diagonal elements of the matrix in order to obtain a ridge regression estimator
(1)
Though the estimator obtained is bias but it yields minimum Mean Squares Error (MSE) when compared to OLS estimator. If k = 0, becomes the unbiased OLS estimator ( ).The choice of ridge parameter k has always been a problem in using RR to solve for multicollinearity, hence methods of estimating the value of k had been suggested by several authors. Below are some suggested methods of estimating k: Hoerl and Kennard [5], Hoerl et al. [6], Lawless and Wang [7], Nomura [8], Khalaf and Shukur [9], Dorugade [10], Al-Hassan [11], Dorugade and Kashid [12], Saleh and Kibria [13], Kibria [14], Zang and Ibrahim [15], Alkhamisi et al. [16], Al-Hassan [17], Muniz and Kibria [18], Khalaf and Shukur [9], Khalaf and Mohamed [19], Uzuke et al. [20] etc.
Several diagnostic methods have been developed to detect influential observation. Firstly, Cook [21] introduced Cook’s distance ( ) which is based on deleting the observations one after another and measuring their effect on linear regression model. Other measures developed on the idea of Cook’s distance includes; modified cook’s distance ( ), DFFITS, Hadi’s measure, Pena statistic, DFBETAS, COVRATIO, etc.
Therefore, problem of multicollinearity and influential observation affect the regression analysis or estimates remarkably. And in using Ridge Regression to mitigate multicollinearity problem, there is always a problem of the method to use to estimate the ridge parameter (k) to achieve reduction in variance larger than increase in bias furthermore, one may want to know whether multiticollinearity affects identification of influential observations.
2. Methodology
The influence of an observation is measured by the effect it produces on the fit when it is deleted in the fitting process. This deletion is always done one point at a time. Let denote the regression coefficients obtained when the ith observation is deleted . Similarly, let and be the predicted values and residual mean square respectively when the ith observation is dropped. Note that
(2)
is the fitted value for the observations m when the fitted equation is obtained with the ith observation deleted. Influential measures look at differences produced in quantities such as or . Several diagnostic methods have been developed to detect influential observation. Firstly, Cook [21] introduced Cook’s Distance ( ) which is based on deleting the observations one after another and measuring their effect on linear regression model. Other measures developed on the idea of Cook’s Distance includes; modified Cook’s Distance ( ), DFFITs, Hadi’s influence measure, Pena statistic, DFBETAS, COVRATIO, etc. This work, adopted the following influential measures;
1) Cook’s Distance
Cook [21] proposed this measure and it is widely used. Cook’s distance measures the difference between the fitted values obtained from the full data and the fitted values obtained by deleting the ith observation. Cook’s distance measure is defined as,
(3)
which can also be expressed as
(4)
Thus, Cook’s distance is a multiplication function of two quantities. The first term in Equation (4) is the square of the standardized residual , which is given
as and the second term is called potential function where is the leverage of the ith observation given as .
If a point is influential, its deletion causes large changes and the value of will be large. Therefore, large value of indicates that the point is influential. It has also be suggested that points with value greater than the 50% point of the F distribution with p + 1 and (n – p – 1) degrees of freedom be classified as influential points.
2) Welsch and Kuh Measure
Welsch and Kuh [22] developed a similar measure to Cook’s Distance named DFFITs, defined as
(5)
is the scaled difference between the ith fitted value obtained from the full data and the ith fitted value obtained by deleting the ith observation. can as well be written as
(6)
where is the standardized residual defined as .
Points with are usually classified as influential points.
3) Hadi’s Influence Measure
Hadi [23] proposed a measure of the influence of ith observation based on the fact that influential observations are outliers in the response variable or in the predictors or both. Accordingly, the influence of the ith observation can be measured by
(7)
where (normalized residual). is an additive function. The first term of the equation is the potential function which measures outlyingness in the X-space and the second term is a function of the residual, which measures outlyingness in the response variable. Observations with large are influential observations in the response and/or the predictor variables. Although the measure does not focus on a specific regression result, but it can be thought of as an overall general measure of influence which depicts observations that are influential on at least one regression result.
4) DFBETAS [1]
DFBETAS measures the difference in each parameter estimate with and without the influential data point. It is an influential measure used to ascertain which observation influence specific regression coefficient
(8)
where denote the regression coefficients obtained when the ith observation is deleted in fitting process and the predicted values from the full data, when ith observation is used in the fitting process.
5) Kuh and Welsch Ratio (COVRATIO)
The COVRATIO statistic measures the change in the determinant of the covariance matrix of the estimates by deleting the ith observation. This influential measure is given as
(9)
which can also be expressed as below
(10)
where n is the sample size, p' is the number of independent variable and hii is the hat matrix.
The ridge parameter estimators which were selected to control multicollinearity are
a) Hoerl and Kennard [5]
b) Kibria [14]
c) Alkhamisi et al. [16]
d) Muniz and Kibria [18]
e) Muniz and Kibria [18]
f) Muniz and Kibria [18]
g) Muniz and Kibria [18]
where
h) Dorugade [10]
i) Uzuke et al., [20]
where
j)
3. Illustration
Using the Nigeria Economic indicator (1980-2010) data from the Central Bank of Nigeria (CBN) Statistical Bulletin 2010. The data consist of Gross Domestic Product as the dependent variable (y) and ten [10] independent variables namely Money Supply (x1), Credit to Private Sector (x2), Exchange Rate (x3), External Reserve (x4), Agricultural Loan (x5), Foreign Reserve (x6), Oil Import (x7), Non-oil Export (x8), Oil Export (x9), and Non-oil Export (x10) shown in Appendix III.
Table 1 showed that there is presence of multicollinearity in the data, since most of the independent variables have VIF > 10, the eigen-value close to zero(0), T < 0.1 and CN > 5 The correlation matrix of the data set also showed the presence of multicollinearity.
Identification of Influential Observations
Using five different influential measures; Cook’s distance, DFFITs, Hadi influence measure, DFBETAs and COVRATIO, influential observations in the real data are identified using the criteria of Table 2 when multicolinearity is not controlled (OLS: k = 0) and when controlled using the selected ridge parameter estimators. The values for the measure criteria are presented in Table 2.
The influential observations identified by the five influential measures in the presence of multicollinearity and when controlled using some selected ridge parameters (k) were presented in Table 3. When compared with values of Table 2,
Table 1. Result of test for multicollinearity.
Tableshowed that there is presence of multicollinearity in the data, since most of the independent variables have VIF > 10, the eigen-value close to zero (0), T < 0.1 and CN > 5 The correlation matrix of the data set also showed the presence of multicollinearity.
Table 2. Influential measures, calculated measure criteria and values obtained.
Table 3. Influential observations identified.
any observation whose calculated influence measure is greater than the criteria value obtained is identified as an influential observation or data point. Cook’s Distance and Hadi influence measure performed alike. They fail to identify influential data points when ridge estimators were used to control multicollinearity. DFFITs and COVRATIO measure identified single observation 25 in both OLS and when multicollinearity was controlled while DFBETAS identified data point 29 as well.
4. Summary and Conclusion
Ridge estimator affects influential observation identified. Cook’s distance and Hadi influence measure were able to identify several influential data points on the data in the presence of multicollinearity but failed to identify any data points when the multicollinear effect has been controlled. DFFITs, DFBETAs and COVRATIO identified the same single data point in the presence of multicollinearity and when it has been controlled. Cook’s distance and Hadi influence measure are very sensitive in the presence of multicollinearity, this made them to identify several influential data points but they are less sensitive when multicollinearity is controlled where they fail to identify and data point. DFFITs, DFBETAs and COVRATIO perform better than them and should be used when multicollinearity is controlled.
Appendix I
Algorithm for the R Programme
The model
Using the unit length scaling shown below:
,
where is the mean of Y, is the mean of , and
, and ,
such that ,
We obtain the following model
Obtain
Eigenvalues of A = tj
Eigenvectors of A = D
Confirm that
Confirm that
Obtain
Obtain
Methods of estimating ridge parameter k
1) Hoerl and Kennard (1970)
where, is the residual mean square estimate of and is the ith element of which is an unbiased estimator of where D is the eigenvectors of the matrix
2) , Kibria (2003)
3) Alkhamisi et al. (2006)
where is the ith eigenvalues of the matrix and
4) Muniz and Kibira [18]
5)
6)
7)
where
8) , Dorugade [10]
9) Uzuke et al. [20]
where the weight
10) OLS =
Methods of detecting influential observation
Method 1 (cook’s distance)
,
The criteria is given as
where
, and
Method 2 (DFFITs)
The criteria is given as
where is the R-residual defined as and
Method 3 (Hadi measure)
where called normalized residual.
Method 4 (DFBETAS)
The criteria is given as
Method 5 (COVRATIO)
The criteria is given as
where
, and
Appendix II
R Codes for Detecting Influential Observation for Different k Values
for(i in 1:9){
h=matrix(hatr(lmridge(V1~.,rr, k[i]],30,30)
ss=(sqrt(h[i,i]/(1-h[i,i])))
C=NULL
DF9=NULL
H=NULL
DFB=NULL
COV=NULL
for(i in 1:30){
b1=coefficients(lm(V1~.,rr[-i,]))
r1=c(residuals(lm(V1~.,rr[-i,])))
sig1=(sum(r1^2))/(n-p)
num=c[3]-b1[3]
hh=solve(t(xx[-i,])%*%(xx[-i,]))
denom=sqrt(sig1*hh[3,3])
C=rbind(C,(((r[i]^2/((sig)*(1-h[i,i]))))/(11))*(h[i,i]/(1-h[i,i])))
DF9=rbind(DF9,r[i]/(sqrt(sig1*(1-h[i,i])))*sqrt(h[i,i]/(1-h[i,i])))
H=rbind(H,(h[i,i]/(1-h[i,i]))+(11/(1-h[i,i]))*(r1[i]/sqrt(ssr)))
DFB=rbind(DFB,num/denom)
COV=rbind(COV,(sig1/sig)*(h[i,i]/(1-h[i,i])))
}
Appendix III
Table A1. Nigerian economic indicator (1980-2010) data.
[1] Belsley, A., Kuh, E. and Welsch, R. (1989) Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley and Sons, New York.
[2] Johnson, R.A. and Wichern, D.W. (2002) Applied Multivariate Statistical Analysis. Pearson Education, Delhi.
[3] Mickey, M.R., Dunn, O.J. and Clark, V. (1976) Note on the Use of Stepwise Regression in Detecting Outliers. Computers and Biomedical Research, 1, 105-111.
https://doi.org/10.1016/0010-4809(67)90009-2
[4] Andrews, D.F. and Pregibon, D. (1978) Finding the Outliers That Matters. Journal of Royal Statistical Society, Series B, 40, 85-93.
https://doi.org/10.1111/j.2517-6161.1978.tb01652.x
[5] Hoerl, A.E. and Kennard, R.W. (1970) Ridge Regression: Biased Estimation for Non-Orthogonal Problems. Technometrics, 12, 55-67.
https://doi.org/10.1080/00401706.1970.10488634
[6] Hoerl, A.E., Kennard, R.W. and Baldwin, K.F. (1975) Ridge Regression: Some Simulations. Communications in Statistics, 4, 105-123.
https://doi.org/10.1080/03610917508548342
[7] Lawless, J.F. and Wang, P. (1976) A Simulation Study of Ridge and Other Regression Estimators. Communications in Statistics A, 5, 307-323.
https://doi.org/10.1080/03610927608827353
[8] Nomura, M. (1988) On the Almost Unbiased Ridge Regression Estimation. Communications in Statistics—Simulation and Computation, 17, 729-743.
https://doi.org/10.1080/03610918808812690
[9] Khalaf, G. and Shukur, G. (2005) Choosing Ridge Regression Parameters for Regression Problems. Communications in Statistics—Simulation and Computations, 32, 419-435.
[10] Dorugade, A. (2014) New Ridge Parameters for Ridge Regression. Journal of the Association of Arab Universities for Basic and Applied Sciences, 15, 94-99.
https://doi.org/10.1016/j.jaubas.2013.03.005
[11] Al-Hassan, Y.M. (2010) Performance of a New Ridge Regression Estimator. Journal of the Association of Arab Universities for Basic and Applied Science, 9, 23-26.
https://doi.org/10.1016/j.jaubas.2010.12.006
[12] Dorugade, A.V. and Kashid, D.N. (2010) Alternative Methods for Choosing Ridge Parameter for Regression. Applied Mathematical Science, 4, 447-456.
[13] Saleh, A.K.Md. and Kibria, B.M.G. (1993) Performances of Some New Preliminary Test Ridge Regression Estimators and Their Properties. Communication in Statistics—Theory and Methods, 22, 2747-2764.
https://doi.org/10.1080/03610929308831183
[14] Kibria, B.M.G. (2003) Performance of Some New Ridge Regression Estimators. Communications in Statistics—Simulation and Computation, 32, 417-435.
https://doi.org/10.1081/SAC-120017499
[15] Zang, J. and Ibrahim, M. (2005) A Simulation Study on SPSS Ridge Regression and Ordinary Least Square Regression Procedures for Multicollinearity Data. Journal of Applied Statistics, 32, 571-588.
https://doi.org/10.1080/02664760500078946
[16] Alkhamisi, M., Khalaf, S. and Shukur, G. (2006) Some Modifications for Choosing Ridge Parameters. Communications in Statistics—Theory and Methods, 37, 544-564.
https://doi.org/10.1080/03610920701469152
[17] Al-Hassan, Y.M. (2008) A Monte Carlo Evaluation of Some Ridge Estimators. Japan Journal of Applied Science: Natural Science Series, 10, 101-110.
[18] Muniz, G. and Kibria, B.M.G. (2009) On Some Ridge Regression Estimators: An Empirical Comparison. Communications in Statistics—Simulations and Computation, 38, 621-630.
https://doi.org/10.1080/03610910802592838
[19] Khalaf, G. and Iguernane, M. (2014) Ridge Regression and Ill-Conditioning. Journal of Modern Applied Statistical Methods, 13, 355-363.
https://doi.org/10.22237/jmasm/1414815420
[20] Uzuke, C.A., Mbegbu, J.I. and Nwosu, C.R. (2017) Performance of Kibria, Khalaf and Shukur’s Methods When the Eigenvalues Are Skewed. Communications in Statistics—Simulation and Computation, 46, 2071-2102.
https://doi.org/10.1080/03610918.2015.1035444
[21] Cook, R. (1977) Detection of Influential Observations in Linear Regression. Technometrics, 19, 15-18.
https://doi.org/10.1080/00401706.1977.10489493
[22] Welsch, R. and Kuh, E. (1977) Linear Regression Diagnostics. Technical Report, Solan School of Management, Massachusetts Institute of Technology, Cambridge, 923-977.
https://doi.org/10.3386/w0173
[23] Hadi, A. (1992) A New Measure of Overall Potential Influence in Linear Regression. Computational Statistics and Data Analysis, 14, 1-27.
https://doi.org/10.1016/0167-9473(92)90078-T