Sample surveys’ main objective is to obtain information about the population, and then use such information to make inference about some population quantities. The information that is mostly sought about the population is usually aggregate values of various population characteristics, total number of units, proportion of units having certain attributes. The information can be collected by either sampling methods or census. One of the approaches to using auxiliary information in construction of estimators is by assuming a working model that describes the relationship between the survey variable and the auxiliary variable. Estimators are then derived based on this model. At this stage, estimators are sought to have good efficiency given that the model is true. In most cases, a linear model is assumed. Generalized regression estimators by  and  including linear regression estimators and ratio estimators by  , and best linear unbiased estimators by  and  and post-stratification estimators by  as well are all derived from the assumption of linear models. Sometimes the linear model fails, and therefore, the resulting estimators do not beat the purely design-based estimators. As a result,  proposed a class of estimators in which the working model assumes a nonlinear parametric model. The improvement of the efficiency of such estimators, however, requires prior information about the exact parametric population structure. As a result of these concerns, several researchers have so far considered nonparametric models for. Nonparametric regression may be used in the estimation of unknown finite population quantities such as population totals, means, proportions or averages. The idea of nonparametric regression traces its origin in works by  and  . Nonparametric-based estimation is often more robust and flexible than inference based on parametric regression models or design probabilities (as in designed-based inference)  . In sample surveys, auxiliary information is used at the estimation stage of finite population quantities-population total or mean, say-to increase the precision of estimators of such population quantities    .
A variety of approaches exist for construction of more efficient estimators for population total or mean, and they include model-based and design-based methods. Model-based approach in sample surveys is based on superpopulation models, which assumes that the population under study is a realization of a random variable having a superpopulation model. This model is used to predict the nonsampled values of the population, and hence the finite population quantities, total or mean  .  first considered nonparametric models for within a model-assisted approach and obtained a local polynomial regression estimator as a generalization of the ordinary generalized regression estimator. Their simulation study shows that the proposed estimator performs relatively better than other parametric estimators.  improved on  estimator and developed a model-based local polynomial regression estimator applicable to direct sampling designs such as simple random sampling and systematic sampling. Their estimator demonstrates better performance than  model-assisted estimator. Their estimator also beats other parametric estimators.
In this paper, auxiliary information is used to determine an estimator of finite population total using nonparametric regression under stratified random sampling. To achieve this, a model-based approach is adopted by making use of the local polynomial regression estimation to predict the nonsampled values of the survey variable y. Stratified estimators for finite population total or mean have proved to yield better estimators than those resulting from simple random sampling   . Additionally, it has been shown in the literature that local polynomial approximation method has several nice features including satisfactory boundary behaviour, easy interpretability, applicability for a variety of design-circumstances and nice minimax properties (see   and  ).
2. Proposed Estimator
Consider a population consisting of N units. Suppose this population is divided into H disjoint strata, each of size.
Let be the survey measurement for the unit in the stra- tum. Further, let be the auxiliary measurement positively correlated with.
From each stratum, a simple random sample of size is selected without replace- ment, where is sufficiently large with respect to and.
Let be the sample in the stratum and be the nonsampled set in the stratum.
The population total is defined as
which can rewritten as
Once the sample has been observed, the problem of estimating Y becomes the problem of predicting the sum of the nonsampled. Usually, inference is made using the known sample and the model.
The first component in Equation (1) is known while the second requires prediction which is the focus in this paper. In this paper, local polynomial regression method will be used to predict the unknown,.
Suppose the distribution generating is given by the superpopulation model, in which
where are independently distributed random variables with mean 0 and variance.
Then it follows that
where and are assumed to be continuous and twice differentiable fun- ctions of x, and.
In practice, the values of are unknown and so requires prediction. Adopting   and  ideas, we make use of local polynomial regression of degree p, which is a generalization of the kernel smoothing, to predict the unobserved in Equation (1). Let, where K denotes a continuous kernel function and b is the bandwidth.
Then a model-based local polynomial regression estimator of the nonsampled in the stratum is given by:
where is a column vector of length;;
and. Equation (6)
holds as long as is a nonsingular matrix.
Now denoting the estimator for the finite population total by and the estimator within the stratum by. Therefore, in stratum h, the estimator of the popu- lation total based on local polynomial regression is
and the estimator for the finite population total is
3. Properties of Proposed Estimator
In this section, a study is carried out on various properties of estimator (8), which may be important in practice. In doing so, the following assumptions are made:
1) The regression function has a bounded second derivative.
2) The marginal density, is continuous and.
3) The conditional variance is bounded and continuous.
4) The kernel density function is bounded and continuous satisfying the
following:, , and
These conditions on were imposed and used in  work and are purposely for the convenience of technical arguments and therefore can be relaxed.
3.1. Is Asymptotically Model-Unbiased
Now consider the difference:
and taking expectation yields
which is the bias associated with.
Approximating by Taylor series expansion about a point and assuming further that and, then observe that
and applying expectations then
Theorem 3 of  allows that under conditions (1)-(4) if and,
It implies that provided that and, and thus is asymptotically model-unbiased.
3.2. Mean Square Error (MSE) of
The estimator (8) has the MSE
which can be decomposed as
Theorem 1 of  allows that under Condition (1), if then
Observe that Equation (24) tends to zero if and and thus
This shows that is statistically consistent and thus useful.
4. Simulation Study
The first estimator is design-based, the second one is parametric and model-based while the last two are nonparametric and model-based.
4.1. Description of the Population
The working model is taken to be,. In this study, four populations are considered, which are generated from the regression model given by
with the following mean functions
with. They represent a class of correct and incorrect model specifications for the estimators being considered. For, is expected to be the best estimator, since the model assumed is correctly specified. The rest of the mean functions:, and represent various deviations from the linear model,. These populations are plotted in Figure 1. For more on these populations, see  and  .
The errors are assumed to be independent and identically distributed (i.i.d) normal random variables having mean 0 and standard deviation. They contain 2000 units and the population is simulated as i.i.d uniform random variables. The
Table 1. Estimators being compared in the Simulation study.
Figure 1. Plot of linear, sine, bump and jump populations.
population values are generated from the mean functions by adding the errors in each of the cases. Each of the populations is divided into 10 equal, disjoint and mutually exclusive strata which are made as homogeneous as possible to ensure that units in each stratum vary little from each other. A sample of size, is then taken with each stratum contributing a sample size of,. 1000 samples are simulated using simple random sampling without replacement for each case.
is used for kernel smoothing on each of the populations. In each case, bandwidth values (see  ) (with), , and (see  ) are con- sidered.
Data simulations, the estimators and computations were obtained using R Software on a desktop.
To analyze the performance of the proposed estimator against some specified estimators, relative absolute bias (RAB) is computed as
and the relative efficiency (RE) with respect to the Horvitz-Thompson (HT) estimator is computed as
is the estimator of the finite population total being considered; Y is the true population total and R is the number of replications.
The relative efficiency (RE) is meant to examine the robustness of the various estimators against the proposed estimator.
The confidence intervals (CI) and the average lengths (AL) of the confidence intervals of various estimators are also computed as follows:
where and are the upper and lower confidence limits respectively; and R are as defined earlier.
The results of this simulation study are summarized in Table 3 and Table 4. For each populations, (), the performance of each estimator is analyzed using the RAB and RE. The RAB indicates the measure of how close the estimator being considered is from the actual value, while the RE is used to check the robustness of the estimator. For instance, an estimator, , will be said to be “better” or more preferable than another one, , if its RE is comparably smaller. That is, if, where and are estimators, then is said to be “better” than.
Table 2. Summary of the formulae used in computing the respective population totals of the various estimators.
The confidence intervals and average length of the intervals are also measured for each case. A smaller length is better because it implies that the true population total is captured within a smaller range and therefore results are more precise.
In most scenarios, is better than the parametric estimators, but the parametric estimator, , performs best when the model is correctly specified, as Table 3 shows. This occurs both in the linear and the bump populations, where in the former, a strong linear relationship holds between the variables while in the latter, the function is linear over most of its range despite a “bump” for a small part of the range of.
When the model is completely misspecified as in the Sine and Jump populations, a greater efficiency can be achieved by the nonparametric regression estimators. This can be seen in Table 3 for the Sine and Jump populations: the nonparametric estimators (and) are more efficient than their parametric opponent,.
When the underlying superpopulation model is completely unknown, a reasonable choice for finite population total estimation would be the nonparametric estimators such as and with small bandwidth choices. This can be seen in Table 3 and Table 4.
In this study, is sometimes seen to perform much bettter but not as worse as, and hence the proposed estimator, emerges as the best performing among the nonparametric estimators being considered here (see Table 3). A good overall performance is observed with the proposed estimator, with smaller values of RAB and RE than the model-based competitor for every population and fixed bandwidth under consideration.
Despite being relatively the best estimator, its performance is significantly affected by the bandwidth choices. As the bandwidth size increases, some amount of efficiency is lost (see Table 3).
Table 3. Relative absolute bias (RAB) and relative efficiency (RE) based on 1000 replications of simple random sampling within strata from four fixed populations of size. Sample size is.
Table 4. Estimated lower and upper confidence limits and corresponding average lengths based on 1000 replications of simple random sampling within strata from four fixed populations of size. Sample size is. (LCL is the Lower Confidence Limit, UCL is the Upper Confidence Limit and AL is the Average Length).
Additionally, a keen look at the estimated totals in Table 3 shows that: as the bandwidth increases, the local linear regression estimator, becomes equivalent to the linear regression estimator,. This shows that the bandwidth has an effect on the mean square error of. Particularly, for whichever bandwidth that is considered in this study, essentially dominates for all the populations except Linear and Bump populations, where is competitive. Further, essentially dominates for all populations except in the Jump population, where dominates all estimators being considered. The overall performance of is consistently good as long as the bandwidth remains small in this particular study.
In this study, performance of the proposed estimator has been investigated against some design-based and model-based regression estimators. The RE values of the proposed estimator are in general close to one. It has been shown that for whichever bandwidth considered, essentially dominates for all the populations except Linear and Bump populations, where is competitive. Further, essentially dominates for all populations except in the Jump population, where it dominates all estimators being considered. Generally, good confidence intervals are seen for the nonparametric regression estimators, and use of the proposed estimator leads to relatively smaller values of RE compared to other estimators. We conclude that non- parametric regression approach under stratified random sampling using the proposed estimator yields good results.
Special thanks to the African Union (AU) for the funding that saw the success of this research.