Longitudinal surveys refer to a type of sampling surveys done repeatedly over time on the same sampled units. In such surveys, data which are rich in information about the specific sampled unit can be obtained and thus suitable for various purposes. While longitudinal surveys are regarded to be better and reliable in informing about various features of a study unit, they suffer from monotone and intermittent patterns of missing data. This is often as a result of inaccessibility to or deliberate refusal of respondents to provide information after having participated in the surveys thus the occurrence of nonresponses.
Missing data are a problem because nearly all standard statistical methods presume complete information for all the variables included in the analysis. Using data with missing values leads to reduction in sample size which significantly affects the precision of the confidence interval, statistical power reduce and biased population parameter estimates. Imputation is one of the approaches used to intuitively fill in these missing values. Over time, various imputation models have been developed and they have been used to overcome quite a number of challenges caused by missing data. However, some shortcomings still exist such as biasedness and inefficiency of estimators. This is because imputation models have different assumptions in both parametric and nonparametric contexts.
Parametric methods like maximum likelihood estimation have limitations like sensitivity to model misspecification while nonparametric methods are more robust and flexible  . Some of the methods used by  are simple linear regression imputation and Nadaraya-Watson technique. From their simulation results, it was found that the simple linear regression imputation approach has the weakness of producing biased estimates even when the responses at a particular time (including previous values) are correctly specified. On the other hand, Nadaraya-Watson technique of  and  used in the imputation of missing values in the longitudinal data has some weaknesses of producing a large design bias and boundary effects that give unreliable estimates for inference.
As shown by  and  , a rival for Nadaraya-Watson technique is the local linear regression estimator which was found to produce unbiased estimates without boundary effects.  studied the weighted Nadaraya-Watson method and was concerned with the limitations of the method such as consistency, asymptotic normality and the interior and boundary point effects. In his study, he found that local linear regression is much better than the weighted Nadaraya-Watson method as it produces asymptotically unbiased estimates without boundary effects. Moreover,  also found that the local linear regression estimator (introduced by  ) has desirable properties.
In order to overcome the limitations of Nadaraya-Watson estimator, we derive a local linear regression estimator in the imputation of the nonresponndents in a longitudinal data set. The asymptotic properties (unbiasedness and consistency) of the proposed estimator are investigated. Comparisons between various estimators (parametric and nonparametric) are performed based on the bootstrap standard deviation, mean square error and percentage relative bias. A simulation study is conducted to determine the best performing estimator of the finite population mean.
2. Assumptions and Notations
1) All sampled units are observed on the first time point and remain in the sample till the final time. The variable of interest is the value of y for the unit at time point t.
2) The prediction process is past last value dependent and the vectors
are independently and identically distributed (i.i.d) from the superpopulation under the model-assisted approach.
For and and the response indicator function is
3) The vector follows the Markov chain for longitudinal survey data without missing values
4) We assume that the population P is divided into a fixed number of imputation classes, which are basically unions of some small strata.
3. Regularity Conditions
Denote f to be a probability density function (pdf) of X and where is defined by;
and g and f have bounded second derivatives
i) The Kernel function K is a bounded and twice continuously differentiable symmetric function on the interval, and such that,
, , and.
ii) The regression function is at least twice continuously differentiable every- where in the neighborhood of.
iii) The sample survey variable of interest has a finite second moment bounded on the interval. Thus.
iv) The conditional variance is bounded and continuous.
4.1. Imputation Process
Considering the case of the last past value, we do impute for missing value by the value obtained through the prediction procedure. But according to  , the joint distribution of bivariate random variables () is preserved when the missing value, Y is imputed by the conditional distribution of Y given X. Therefore, considering the conditional mean imputation approach for the single imputation.
be the conditional expectation with respect to the superpopulation for unobserved value with observed value for.
It is therefore clear that when is known, then the imputed value of unobserved is given by. In cases where in Equation (4) is unknown, for nonmonotone nonrespondents, we employ the last value dependent mechanism.
Under assumption (2), we have
Using Equation (4), we are limited to do estimation by regressing the nonrespondents on the observed values based on the longitudinal survey data, therefore, we apply the equivalent Equation (5) which allows estimation using data from all subjects having observed and observed. Then, the imputation of the nonrespondents is done using in Equation (5) and under the last value dependent assumption, we are able to use auxiliary survey data in regression fitting. According to  , imputing nonresponses using (5) was done for monotone case and their approach is easy to apply if the conditional expectation say, in (4) has a linear relationship with x. Adopting the concept of nonparametric method in  , here, the local linear estimator of is. Let be the variable of interest for the i-th unit at time t where and. Associated with each are the known, , of q auxiliary variables. To make the notations and writings simple, we relax the index t and write with a single subscript i, thus is written as.
The regression imputation model is given by the relation
such that’s are residuals which are assumed to be independently normally distri- buted with mean zero and variance.
It is clear that
where is an unknown regression function which is a smooth function of x.
To obtain the estimator of at and its derivatives, we use the weighted local polynomial fitting by assuming that the regression function with derivatives at a point, say, exists and are continuous.
We can rewrite the imputation model (6) as
where approximation of about is done following the Taylor series expansion.
The kernel weight given as
where h is the bandwidth and K is the kernel function which should be strictly positive and controls the weights, is the point of focus and being the covariate with design matrix centered at past last value and j is the order of the local polynomial.
Equation (12) is the Nadaraya-Watson estimator.
With estimator the, the conditional expectation given by is used to impute the missing values, i.e.
where is the survey weight and
Minimizing S with respect to and in Equation (15) and solving for and, we get
Using, in Equation (17), we obtain
and with, in Equation (17), it yields
Similarly, using, in Equation (16) gives
and with, Equation (16) becomes
The local linear estimator for the regression function is now given by:
Substituting for (from Equation (20)) and (from Equation (18)) in Equation (22) gives,
With estimator, , the conditional expectation given by is used to impute the missing values, i.e.
where, is the weight according to the survey design and is as defined earlier.
4.2. Estimation of the Finite Population Means Using the Imputed Data
In this study, we consider a finite population from which samples are drawn. Before estimation of the population parameters, imputation process is done. Suppose that the survey measurements are on the variables respectively and a simple random sample without replacement, , of size n is selected from a finite population, P of size N. The sample consists of two parts: and, where is the set of all respondents in the survey and is the set of all non-respondents. The missing observations of the sample unit, for are considered. Impu- tation of the missing value for and is done and then a complete data set is produced which is then used in the estimation of finite population means.
Let be the finite population mean at time point, t for
The value to be imputed for the non respondent is denoted by such that the imputed data is given as
The mean of the finite population is given by
Now, using the imputed data, the estimator of the finite population total is the sample total of the imputed data denoted by and is given by
Thus, using the imputed data, the estimator of the finite population mean is the sample mean of the imputed data denoted by, given by
Assuming that for each
The imputed values are treated as if they were observed such that both observed and the imputed are used in the estimation of the population mean:
Sample mean for the imputed data becomes
Note that the same weight due to sampling design is used in Equation (29) for all units in the sample.
Since t is used as a constant variable, Equation (30) is re-written as
As for  , the local constant estimation for the nonrespondents in Equation (31) is obtained as:
and the local linear estimation for the nonrespondents, in Equation (31) is given by:
Clearly, in Equation (31) is substituted by Equation (32) and Equation (33) for use of local constant estimator and local linear regression estimator respectively.
5. Asymptotic Properties of the Estimator
In the derivation of the asymptotic properties, we use the set of regularity conditions. According to  , the asymptotic theory development is provided by the concept of a sequence of finite populations with strata in. It is assumed that there is a sequence of finite populations and the corresponding sequence of samples. The finite population P indexed by is assumed to be a member of the sequence of the populations. The sample size denoted by and the population size denoted by approach infinity as. The uniform response and the size of the nonrespon-
dents set satisfy the condition. All limiting processes will be under-
stood as such that the regularity conditions are satisfied. For easy notation, the subscript will be ignored in the subsequent work.
Theorem 1. Assuming the regularity conditions (i)-(iv) and also the assumptions in section 2 hold. Then under the regression imputation model, (6), the estimator, in Equation (31), is asymptotically unbiased and consistent for the population mean.
Proof. 1) Bias of.
The general formula for the finite population total is given by:
where and are the sampled and the non sampled sets respectively.
Equation (34) can be decomposed as
For simplicity, denote, and by, and respec- tively throughout the remaining work.
From Equation (31), the estimator for the finite population total is given by
Now consider the difference,
Taking expectation on both sides of Equation (38), we have
Assuming such that, then in Equation (41) and hence,
But from Lemma 1 (see Appendix),
Thus the bias of becomes
2) Variance of.
The variance of is given by the variance of the error term. That is,
for sufficiently large n such that and; where
3) Mean square error (MSE) of.
Finally, we have
which is the asymptotic expression of the MSE for. as and, and thus is consistent.
Consequently, is asymptotically unbiased and consistent.
6. Simulation Study
6.1. Description of Longitudinal Data
In this section, a study of the finite population mean estimators based on four measures of performance (percentage relative bias (%RB), MSE and bootstrap standard deviation (SD bootstrap)) is carried out.
Simulations and computations of the finite population mean estimators were done using R (R version 3.2.3 (2015-12-10)) based on 1000 runs. For the the local linear and local constant estimators, the Gaussian kernel with a fixed bandwidth of was used. To fit the nonparametric regression, the loess function in R was used.
For comparison purposes, we used complete data as our main reference in the evaluation of the performance of the estimators (Proposed local linear estimator, local constant estimator and the simple linear regression estimator).
In this simulation study, a sample of size was considered. The longitudinal data for each of the sampled units is of size that is,. This will yield 23 different patterns of the longitudinal data with each of respondent and non- respondent values being denoted by 1 and 0 respectively at different time points.
Longitudinal data was generated according to two models:
1) In model 1, simulation of is done from a multivariate normal distribution with the means for the 4 time points as 1.33, 1.94, 2.73, 3.67 respectively and the covariance matrix following the model with standard error 1 and correlation coefficient 0.9.
2) In model 2, simulation of is done from a multivariate normal distribution with the means for the 4 time points as 1.33, 1.94, 2.73, 3.67 respectively and the covariance matrix following the model with standard error 1 and correlation coefficient 0.9.
In order to obtain the nonmonotone pattern in the simulated data, we used the predetermined unconditional probabilities of  shown in Table 1.
6.2. Bootstrap Variance Estimation
The following steps were used to obtain the bootstrap variance.
1) We constructed a pseudo population by replicating the sample of size 1500 times through 1000 simulation runs.
Table 1. Probabilities of nonresponse patterns for.
2) A simple random sample of size 200 was drawn with replacement from the pseudo population.
3.) We applied the simple linear regression, local constant and local linear regression imputation models to impute the missing’s of the sample.
4) Repeating the steps 2 and 3 for a large number of times () to obtain where is the analog of, for the b-th bootstrap sample.
5) Obtain the bootstrap variance of by the formula
where is the mean bootstrap analog of, given by.
6.3. Results and Discussion
In terms of the percentage relative bias (%RB), at time point, it can be seen that the local linear estimator has the least value followed by the Nadaraya-Watson estimator and then the simple linear regression estimator, which was the largest value of %RB.
At time point, observe that the the simple linear regression estimator has the least %RB value compared to that of the local linear estimator and the Nadaraya- Watson estimator performed worst with the largest %RB. The %RB values of the local linear estimator and the simple linear regression estimator are very much closer to zero than those for the other estimators.
At time point, observe that the local linear estimator has the least %RB value followed by the simple linear regression estimator and the Nadaraya-Watson estimator performed worst. Through comparisons based on %RB with reference to the complete data, the local linear estimator has its %RB values approaching zero.
In terms of MSE, at time point, Nadaraya-Watson estimator has the least values followed by the local linear estimator and lastly the simple linear regression estimator which has the largest values. At time point, the local linear estimator has the least values of MSE followed by the simple linear regression estimator and lastly
Table 2. Simulated results for mean estimation (normal case).
Table 3. Simulated results for mean estimation (log-normal case).
the Nadaraya-Watson estimator which has the largest MSE value. At time points, Nadaraya-Watson estimator has the least values of MSE followed by the simple linear regression estimator and lastly the local linear estimator which has the largest MSE value.
In terms of the bootstrap standard deviation, it can be seen that the local linear estimator performs the best at all the three time points, , and in which its values are even lower than those of the complete data implying that the results got with the local linear estimator are the best. The simple linear regression and Nadaraya-Watson estimators are competing interchangeably in terms of performance for the bootstrap samples.
In terms of the percentage relative bias (%RB), at time points and, observe that the simple linear regression estimator has the least %RB values followed by the local linear estimator and the Nadaraya-Watson estimator has the biggest %RB values. Based on these aforementioned results, it is viable to choose the best estimator as the local linear estimator which handles both linear and nonlinear models. At time points, observe that the local linear estimator has the least %RB value followed by simple linear regression estimator and lastly the Nadaraya-Watson. This implies that, for, the local linear estimator has the smallest bias close to zero as for the complete data, hence the best estimator compared to others.
In terms of the MSE, at time points and, Nadaraya-Watson estimator has the least values of MSE, followed by the simple linear regression estimator and lastly the local linear estimator which has the largest values of MSE. At time point, the the local linear estimator has the least values implying that it performed well at time point.
In terms of the bootstrap standard deviation, observe from Table 3 that the local linear estimator performs the best at all the three time points since it has the least bootstrap standard deviations and these values are even smaller than those of the complete data in order of increasing time.
From Table 3 of results, it is can be seen that the bootstrap standard deviations of the local linear estimator are more close to those of the Nadaraya-Watson estimator than the simple linear regression estimator.
Generally, nonrespondents in any survey data has a significant impact on the bias and the variance of the estimators and therefore, before using such data in statistical inference, imputation with an appropriate technique ought to be done. In this study, the main objective was to obtain an imputation method based on local linear regression for nonmonotone nonrespondents in longitudinal surveys and determine its asymptotic properties. Comparing the parametric and nonparametric methods, nonparametric methods performed better than the parametric methods. This was evident from the MSE and %RB values in both the normal and log-normal data. Among the nonpara- metric methods, the local linear estimator was the best estimator as it behaved better than the Nadaraya-Watson estimator in terms of %RB. In terms of the bootstrap standard deviation, the local linear estimator performs the best at all the three time points since it has the least bootstrap standard deviations for the two data sets. Generally, the local linear estimator performs relatively well and in particular in the normal data. We conclude that use of the nonparametric estimators seem plausible in both theoretical and practical scenarios.
We wish to thank the African Union Commission for fully funding this research.
LEMMA 1. The bias of is given by
Under the regularity conditions in section 3, as and
PROOF OF LEMMA 1.
Proof. From Equation (23),
The expectation of is given by
The bias of is therefore given by
For fixed design points of’s on the interval, the expression
almost everywhere, see  .
Equation (56) becomes
Hence, the bias of can be re-written as
and hence the result.
LEMMA 2. The asymptotic expression of the variance of is given by
as and; where.
PROOF OF LEMMA 2.
Proof. Using Equation (23),
It follows that
The asymptotic expression of the variance of becomes
where. Hence the result.
From LEMMA 1 and 2, the MSE of becomes
Submit or recommend next manuscript to SCIRP and we will provide best service for you:
Accepting pre-submission inquiries through Email, Facebook, LinkedIn, Twitter, etc.
A wide selection of journals (inclusive of 9 subjects, more than 200 journals)
Providing 24-hour high-quality service
User-friendly online submission system
Fair and swift peer-review system
Efficient typesetting and proofreading procedure
Display of the result of downloads and visits, as well as the number of cited articles
Maximum dissemination of your research work
Submit your manuscript at: http://papersubmission.scirp.org/
Or contact email@example.com