The analysis of survey data based on public access to large medical and social surveys, such as the Demographic and Health Survey data (DHS), is becoming very common in huge number of studies. The samples are often obtained through complex designs, involving stratification, clustering, multistage sampling, unequal probability of selection of participants and responding rates. Using clustering, observations from the same cluster are correlated and, in order to have unbiased estimators, sample weighting needs to be adjusted for this cluster effect. Ignoring the sampling method in data analysis can lead to inaccurate results. Some authors evaluated the adverse consequences of ignoring the sampling scheme in statistical analysis  . That means that in order to make valid inference for the population of interest where samples were originated, appropriate statistical methods are required to analyze such complex survey data.
In medical and social sciences, where the interest is to predict a binary outcome from a set of covariates, it is common the use of the logistic regression model. The logistic regression model is a member of the generalized linear models (GLM) class and it is an appropriate model for studying the relationship between a binary response variable Y, representing success or failure, and a set of covariates . Assuming then a Bernoulli distribution for the outcome variable Y, follows that this model can be written as:
where are the unknown parameters to be estimated and is the probability of the success.
The parameters of the model are estimated by the maximum likelihood method, assuming that the observations are independent and identically distributed. However, under complex sample designs, involving stratification, clustering, multistage sampling, and unequal probability of selection of observations, the assumption of independence between observations is usually not observed. The parameters estimation by maximum likelihood could lead us to an incorrect estimation of the standard errors involved and, consequently, problems in the associated hypotheses tests. Therefore, it is necessary to adjust the methods of the standard logistic regression, to consider the complex sampling design in order to make valid inferences  -  .
Several studies in health sciences analyzed data coming from complex sampling design using different type of software  -  , however none of them presents the specification and the estimation methods behind the logistics regression model. Most of the studies are a black-box, presenting only the final results with no mention to the estimation method and estimators properties. Our work fills this gap, giving the opportunity to all researchers to understand and to replicate the logistic model with complex sampling in R.
In this context, this paper focuses on presenting the framework of the logistic regression models for complex sampling design. Further, an application of this methodology is made in the modelling of the use of mosquito bed nets in Mozambique, i.e., by identifying the factors that contribute to the use of the bed net as a way to reduce the risk of contracting the disease in women in reproductive age (15 - 49 years). To achieve the objective of the study, we used the Mozambique Demographic Health and Survey data 2011 (MDHS2011) concerning women aged 15 - 49 years. Since the sample of the MDHS 2011 is probabilistic, stratified and multistage, with unequal weights in the observations, i.e., a complex sampling design, and in order to obtain reliable results, the effect of the sampling design had to be taken in consideration in the descriptive and inferential analyses. Thus, it was necessary to select and use appropriate methods to compensate for the effect of sample design in the analysis, implemented in the survey package of the software R   .
The rest of the paper is organized as follows. The logistic regression for complex survey sample is described in Section 2. In Section 3, we describe the sampling method of the MDHS data and how it was taken into account in the application of the use of the mosquito bed net in Mozambique. Finally, section 4 gives a brief conclusion.
2. Logistic Regression under Complex Survey Data
As referred in   , the standard logistic regression model is inappropriate when the data refer to samples from complex sampling designs.
Suppose that a finite population is divided into strata, each stratum is further divided into primary sample units (PSU), each of which is constituted by secondary sample units (SSU), each comprehending elements. Assume also that the observed data consists of SSU chosen from PSU in the stratum h. The total number of the observation is then given by. Each sampling unit has an associated sampling weight given by the inverse of its probability of inclusion in the sample, denoted here
by, for the -th unit.
Additionally, let denote the binary response variable, denote the covariate matrix and denote the regression coefficients. Thus in general the survey logistic regression model is given by
So under the complex sampling design, the parameters of the logistic regression model are estimated by the maximum pseudo-likelihood method also called weighted maximum likelihood that incorporates the sampling design and the different sampling weights in the estimation of the     . The main idea of this method is to define a function which approximates the likelihood function of the sampled finite population with a likelihood function formed by the observed sample and the known samplings weights     . In this case the pseudo-log-likelihood function is given by
where is the weight of the observation. The maximum pseudo-likelihood estimator of is obtained by deriving the pseudo-log-likelihood function in order
to and equals is to zero,.
Under complex sampling designs, there is not a direct form to calculate the variance estimators. Thus, to obtain the variance estimators by maximum pseudo-likelihood we use the methods like the Taylor linearization (also called as delta method), Jackknife replication and bootstrap    . In this paper, we use the methods like the Taylor linearization method which is the method implemented in the R software package survey  that we use. This method results in the following variance estimator of:
where is the covariate matrix, is the diagonal matrix with elements, and is the pooled estimator within-stratum of the covariance matrix. This estimator is given by
where, being the sum for all the sampled units in PSU j in the stratum h given as and specific
mean in the stratum as. The correction factor is given by, where is the ratio of the number of PSU observed by the total number of the PSU in the stratum h.
The hypotheses tests for the significance of the regression coefficients and the test for the goodness of model fit also need to be modified to incorporate the sampling design and the different weights of the observations. The evaluation of the contribution of the covariates is now made by the adjusted Wald test  , with test statistics given by    :
where is the total number of the selected PSU minus the number of strata and p is the number of covariates. The F statistics above is distributed as a F-distribution with p and degrees of freedom, so that the test p-value =.
Also, in order to obtain valid inferences using this type of design, we introduced Pearson’s test statistic, such as the Rao-Scott adjustments. In alternative we can use to other test statistics already incorporating the sampling plan, such as the Wald statistic adjusted   . We also used the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to compare the models   , the likelihood  for measuring the goodness of fit taking into account the complex sampling frame.
3.1. Burden of Malaria in Mozambique
The burden of malaria in Africa is still an important public health issue particularly in the poorest tropical countries of the continent  . The adverse effects of this disease are related to a vicious cycle of poverty and diseases, particularly in low income countries.
In Mozambique, malaria is first cause of death. It is recognized that factors such as the environmental changes, particularly the levels of precipitation, temperature and humidity contribute to develop the best conditions to have mosquito in all the country. Additionally, it has been very difficult, for economic reasons, to prevent and treat all people, namely women and children, the more vulnerable group. Consequently, a national plan to control and eliminate malaria was developed and implemented in the country at the community level. One of the strategies developed has been the mosquito bed net distribution. However, the indicators related to the use of bed nets by women and children in Mozambique are still clearly below the desired levels  .
In this context it is crucial to understand factors associated to mosquito bed net use in Mozambique.
3.2. Data, Sampling and Design Weights
The data used in the research are the national, population-based, cross-sectional survey from 2011 Mozambique Demographic and Health Survey (MDHS 2011). MDHS 2011 gathers information from 13,919 households, being interviewed 13,745 women aged 15 to 49 years and 4035 men 15 to 64, and having this collection of data be made between July and November 2011, through home interviews with application of three types of questionnaires (households, women and men).
The sample of Mozambique DHS in 2011 followed a complex sampling design (i.e. combined stratified and cluster in two stages, with unequal probabilities of selection that result in weights sample to separate the sample components) and was designed in order to obtain representative estimates at the national, provincial level (11 geographic areas: Maputo province, Maputo city, Inhambane, Gaza, Sofala, Manica, Zambezia,
Nampula, Tete, Niassa and Cabo Delgado), regional (north, center and south) and residence of area (urban and rural), for women of 15 - 49 years and men 15 - 64 years. The strata considered in the sample were at the province level and residence level. The first stage of the selection of the sample consisted of obtaining 611 primary sample units (PSU), which are the enumeration areas, based on the 2007 population and habitation census, with the probability proportional to the number of household in each stratum within the provinces.
The probability of the selection for the PSU j in the h stratum is given by:
where, is the number of the PSU selected in the h stratum, , is the number of the households within the PSU j; is the total number of the households in the stratum h. In the second stage, the secondary sample units (SSU) were sampled, the households. There were selected households (20 in urban PSU and 25 in rural PSU) of the total of households in PSU j of stratum h, with conditional probability of selection household i in PSU j in stratum h given by
where. Thus, the probability of selection an household i in PSU j in stratum h is
Finally, data were collected from all women aged 15 to 49 years and men 15 to 64 years who were in the selected household. The allocation is not proportional in the sample, weights are used to compensate these unequal sampling probabilities; by doing this, we can infer results from the sample to the population. The sample weights are the inverse of the overall probability, with some corrections for non-responses. For further details see  .
3.3. Data Analysis
Survey logistic regression was applied to identify factors conditioning the use of the bed net. The outcome variable is the use of the bed net for sleeping in the last night (1 = use, 0 = no use).
For the aim of this study, we used data of the women in reproductive age, extracted from the women specific survey, and we joint that information with the one provided for the household, in order to obtain some relevant variables like the owner of bed net, type of mosquito bed net in the household, number of bednets in the household, household dwelling sprayed in the last year, which were not include in the women data base. The two databases where linked according to the methodology proposed by  .
The independent variables included in the model are: grouped age, marital status, province, place of residence, education level, wealth index, currently pregnant, currently working, number of household members, sex of the household head, source of drinking water, owner of bed net, type of mosquito bed net in the household, number of bed nets in the household, household dwelling sprayed in the last year.
A multiple logistic regression model, including the independent variables above was fitted to data. In the analysis we used the Rao-Scott tests and calculated the unadjusted odds ratio (UOR) to test for possible associations between the independent variables considered and the outcome. The final model considered only the variables that were associated the use of mosquito nets to the level of significance of 5%, accessed by the adjusted Wald test. Afterward, the adjusted OR were calculated. Because of the data complex sampling nature all analysis were conducted using the R package Survey  , in which all the design features such as stratification, clustering and weighting were accounted for explicitly by using the svydesign function. For describing the model, by specifying the predictors and their functional form together with the link function, we use function svyglm. The model goodness of fit was done like is explained in  .
3.4.1. Summary Statistics
The sample includes 13,745 women at reproductive age (15 - 49 years) in Mozambique in 2011. The median age is 28.6 and mainly of the women lived in rural area (65.3%) and are married (or live together) (67.8%). The educational level is very low: 31.2% are illiterate, 50.2% have the primary education and only 1.3% attained higher educational level; about 11% of women reported being pregnant; 61% are not working and the majority of the women referred the head household was a man (64.7%); most of the households (54.4%) have access to improved source water.
About 61% own at least one mosquito bed net, however only 38.4% of the women used the mosquito bed net to sleep in the last night and 22.6% of the women referred that the household was dwelling in the last 12 months.
Based on Rao-Scoot independence test, we verified a statistically significant association (at the 1% level) between the outcome variable (use of mosquito bed net) and the covariates: grouped age, province place of residence, education level, currently working and source water. For the other variables there is not sufficient evidence to support that they are associated with the outcome variable.
3.4.2. Logistic Regression Estimation
In the first step we included all the covariates cited in the section 3.3. Then we retained only the variables with coefficients statistically significant at the 5% level. This strategy was confirmed using likelihood ratio tests for complex sampling. The final model include: age grouped, marital status, province, place of residence, education level, wealth index, currently pregnant, number of the household members and number of mosquito bed net in the household.
Table 1 shows the logistic regression complex sampling results. As can be seen, adjusting for other variables, older woman have a greater chance to use the mosquito bed net; for example, women aged 20 - 24 years have a probability 1.63 higher to use the mosquito bed net (OR = 1.63) when compared with those aged 15 - 19 years.
The province of the south of the country area (Inhambane, Gaza, Maputo Province and Maputo City) were less likely to the use of bed net when compared to the province Niassa. Women living in rural area have less (OR = 0.70) chance to use the mosquito bed net, compared to women in the rural area.
Married women, living together, widows, divorced and separated are more likely to use the mosquito bed net when compared with single women, OR = 2.28, 2.31, 1.37, 1.37, 1.36, respectively.
More educated women are more likely to use mosquito bed net compared to woman with no education (OR > 1), adjusting for other variables.
Pregnant women are less likely to use mosquito bed net (OR = 0.73). Table 1 also shows that the increase of 1 person in the household leads to a reduction of 25% (OR =
Table 1. Logistic regression estimation results with complex sampling.
0.75) woman chance to use a mosquito bed net. Finally, for each additional bed net in the household there is an increase (OR = 4.22) in the chance to use the bed net, by women, adjusting for other variables.
Complex sampling frames are widely used in population based surveys such as DHS. However, the complexity behind this methodology that involves stratification, clustering, and multistage sampling is still not well understood by applied health scientists. In this paper we fill this gap by specifying the logistic regression model and its estimation within the context of complex sampling, using R software and an example related to bed net use in Mozambique. We show that is possible to have reliable results and more efficient estimators using appropriate methods to correct for the effect of sample design. Moreover, this study together with the availability of an open source software (R) must encourage scientist to use more frequently the large amount of public access survey data bases namely in low income countries.
We thank the Editor and the referee for their comments. This work was partially supported by the Fundaçãopara a Ciência e a Tecnologia (Portuguese Foundation for Science and technology) through the project UID/MAT/00297/2013 (Centro de Matemática e Aplicações). Research of S. Rodrigues Cassy is funded by the Calouste Gulbenkian Foundation grant process 135422. This support is greatly appreciated. We also extend thanks to DHS Measure, for allowing us to use MDHS 2011 dataset for this study.