Reliable data level on human immunodeficiency virus (HIV) and associated determinants at population is crucial in understanding the dynamics of HIV. Countries with generalized HIV epidemics obtain estimates from surveillance systems such as antenatal care surveillance surveys. These surveillance systems have their limitations  . Population based second generation surveillance surveys have been used by a number of countries repeatedly to monitor the epidemic of HIV and are now considered a gold standard  . These surveys address some of the weaknesses that are encountered in antenatal care surveillance surveys   and thus greatly enhance surveillance systems . Second generation surveillance surveys combine behavioural data, socio-economic data and biomedical data which provide together greater explanatory power to assess the HIV epidemic in a country .
In South Africa, several small-scale focused HIV surveys have been conducted   . In the past 10 years Human Sciences Research Council (HSRC) has conducted a series of population based second generation surveilance surveys    . The sampling design of these surveys varies from country to country with varying challenges  .
The design of the South African second generation surveys has had some variations from one wave of the survey to another over the years. In 2002 and 2005 one person aged two years and above was randomly selected in each age group 2 to 14 years, 15 to 24 and 25 years and above in each sampled household  . In 2008, persons younger than two years were further included in the survey . This resulted in a sample of at most four people in each household from four distict age groups, that is younger than 2 years, 2 to 14 years, 15 to 24 years and 25 years and above. In 2012, all household members were eligible to participate in the survey .
In the analyses of surveys where inference is drawn on trends over time, it is important to ascertain whether any differences in the results are true differences or are an artifact of variations in methodological design. For example, inclusion or exclusion of high risk groups in some setting can lead to biased estimates . The objective of this paper is to compare the HIV prevalence estimates when all persons in the sampled households were invited to participate  to those results obtained when one person is randomly sampled in each age group (younger than two years, 2 to 14 years, 15 to 24 years and 25 years and above) as has been carried out in the previous survey of 2008 using the population-based survey data of South African 2012 national HIV prevalence survey reported in Shisana, Rehle, Simbayi, et al. 2014 .
2.1. Design of Sampling Frame
Complex survey designs involve a combination of a number of design components including stratification, multistage sampling and selection with unequal probabilities or weighting. The design of the South African national HIV household surveys is complex and based on a multi-stage stratified cluster sample design. A random sample of 1000 census enumerator areas from a national database of 86,000 enumeration areas (EA) used during the 2001 census  served as the Master Sample of the primary sampling units. The master sample was explicitly stratified by province and locality type of the EAs. Locality types were urban formal, urban informal, rural formal (including commercial farms) and rural informal (tribal authority areas). In the formal urban areas, race was also used as an additional stratification variable . In each sampled EA, a cluster of 15 households was randomly sampled to form the secondary sampling unit. In each sampled household all persons residing at the household including visitors who spent a night before were invited to participate and referred to as “take all approach.
Figure 1 presents the two sampling approaches used in the comparative analysis.
The “take all” approach implies that the designs of the previous HSRC surveys can be deduced from the database. Using the captured data, individuals within each household were grouped by age group as presented in Figure 1. In each age group at each household, a sampling scheme was implemented using proc survey select in Statistical Analysis System (SAS) version 9.3 (SAS Institute) to randomly select one person in each age group to be considered in further analysis (“sub-sampling”). The sub-sampled data (three-stage sampling at EA, dwelling and sub-sampled persons) mimics the previous HSRC survey designs for all practical purposes. In this way, if the sampled person was a refusal, the results were recorded as such in the sub-sampled data.
A consequence of implementing complex survey design is that sampling errors of the survey estimates cannot be computed using standard formulae found in standard statistical texts since they are based on independently and identically distributed random variables. Complex methods of estimating variances for
Figure 1. Schematic representation of the sample design.
complex sample designs are used   which are often larger than those obtained from standard formulae. A design effect defined as the ratio of the complex variance estimate and variance obtained from standard formulae is computed to shed useful light on the precision of survey estimates between the two designs . In this design, a household is considered a cluster and a synthetic measure of homogeneity within clusters (ρ) is computed to measure the level of homogeneity within household for each determinant .
2.2. Weighting and Benchmarking of the Sample
Owing to the multi-stage stratified sampling design of the survey, some individuals have a greater or lesser probability of being selected than others. To correct for potential bias due to unequal sampling probabilities, sample weights were introduced at the EA, household, and individual levels and also to adjust for non-response. The final sampling weight was thus equal to the final EA weight multiplied by the final VP sampling weight adjusted for individual non-response in the take all approach. In the sub-sampling the final sampling weight was thus equal to the final EA weight multiplied by the final VP sampling weight multiplied by the sampling weight of each person in the household in each age group adjusted for individual non-response. Thus, the sampling weights corrected for unequal number of household members within each age group.
The final individual weights were benchmarked to 2012 mid-year population estimates by age, race, sex, and province . This process produced a final sample representative of the population in South Africa for sex, age, race, and province.
2.3. Data Capturing, Management and Analysis
Survey data from questionnaires were double entered and verified by the Data Capturing Unit (DCU) at the HSRC using Census Survey Processing (CS Pro) software Version 5.0 (U.S. Census Bureau). A database was designed with range restrictions to ensure that data captured were not out of range. Exploratory analysis was conducted in SPSS version 17. Final data analysis was conducted in STATA version 12 (Stata Corporation, College Station, TX) taking into account the complex design aspects of the survey via a suite of svy commands. The svy commands were used to obtain the estimates of HIV prevalence or proportions of responses or absolute estimated totals and confidence intervals (95% CI).
In total 15,000 households were sampled from 1000 EAs. The households that were found to have been destroyed were not considered as non-response. Of these sampled households 13,083 (87.2%) were valid and occupied households whilst 1266 were invalid. Of the valid households, 11,079 were interviewed resulting in the 2012 survey household response rate of 84.7% . Proportions of non-response at household level varied from 1347 (10.3%) who refused to take part in the survey, 657 (5.0%) with valid households but empty after required visits. Households in urban formal areas had the lowest response rate, 80.3%%, and households in other geographical areas had response rates above 91%. All provinces had a visiting point response rate of 80% and above, except Gauteng (78.4%) and Western Cape (78.9%). Table 1 presents individual response rate for both specimen and questionnaire.
The results in Table 1 show that there are no considerable differences in the questionnaire and HIV testing response rates between the take all and sub-sampling approaches. The differences in all other determinants such as age, race, sex and geotype are less than two percent.
In the take all approach, the household size ranged from 1 to 18 people whilst in the sub-sample ranged from 1 to 4 people since at most four people could be sampled from each household. The crude percentage of people infected with HIV per household size was practically similar between the take all approach and sub-sample approach (Figure 2). The percentages of households that had at least one person infected with HIV between the two approaches are consistently and systematically different. As the household size increases there is consistently an increasing likelihood that at least one person is infected with HIV. This was more pronounced in the take all approach than sub-sampling approach. This indicates an increased likelihood of similar HIV positive status among individuals from the same household in the take all approach compared to subsampling approach.
Table 2 presents key comparison results between the two sampling approaches when assessing the validity of the HIV results. The HIV estimates between the two methods are very comparable with no consistent pattern in any direction. These results are in agreement with consistent similar proportions of HIV positives between the two methods in Figure 2. However, the estimates based on the
Figure 2. Percentages of HIV positives and households with at least one person infected with HIV per household size.
Table 1. Response rate for take all and sub-sampled approaches.
Table 2. HIV prevalence among participants age 0 years and older, socio-demographic characteristics, coefficient of variation, and the design effect.
sub-sampling approach are more variable than those from the take all approach. The design effects in the take all approach are also slightly higher than those obtained in the sub-sampling. The design effects vary proportionally with the synthetic measure of homogeneity (ρ) indicating a higher intraclass correlation within households with respect to HIV in the take all approach compared to sub-sampling approach. The overall synthetic measure of homogeneity for both methods is ρ = 0.10.
The HIV prevalence estimates is slightly higher 12.2% in the take all approach compared to 11.6% in the subsampling approach. All other estimates for age, sex, race, geotype and province are in the same direction and order of magnitude (Table 2).
The results of the paper from the take all approach used in the 2012 survey  and the sub-sampling design implemented in the previous surveys are compared. The calculated response rate was similar for both methods. The findings show that the estimates of HIV are comparable for all key determinants. However, the estimates from sub-sampling are more variable than those from the take all approach. This could be a function of cluster (household) sample size and intraclass correlation within each cluster which leads to practically less effective sample size due to high correlation. In the generalised epidemic settings like South Africa, the risk of HIV infection is likely to be clustered within households  due to both heterosexual transmission among sexual partners within households and vertical transmission to their children. The overall estimate of intraclass correlation and design effect are similar for both methods. However, for various determinants, the estimates for intraclass correlation and design effects are moderately higher for the take all approach than sub-sampling.
The omparison of the two methods is subject to some limitations. The sub-sampling arm of the study conditions on the household roster for the take all alternative. In actual implementation of sub-sampling method, there could be non-coverage and differential non-response. Thus the simulated experiment might not replicate exactly the outcome that would be obtained under two real survey conditions.
In conclusion, the two approaches yield similar results for all practical purposes. However, even though with high intraclass correlation resulting in lesser effective sample size, the take all approach is more preferable than sub-sampling approach. The take all approach allows for further analyses of data such as estimating discordance between sexual partners and parent-child pair.
The data used in this article comes from a study supported by the President’s Emergency Plan for AIDS Relief (PEPFAR) through the Centers for Disease Control and Prevention (CDC) under the terms of 3U2GGH000570 and the South African National AIDS Council. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of CDC or SANAC.
 Stover, J., Ghys, P.D. and Walker, N. (2004) Testing the Accuracy of Demographic Estimates in Countries with Generalized Epidemics. Aids, 18, S67-S73.
 Garcia-Calleja, J.M., Gouws, E. and Ghys, P.D. (2006) National Population Based HIV Prevalence Surveys in Sub-Saharan Africa: Results and Implications for HIV and AIDS Estimates. Sexually Transmitted Infections, 82, iii64-iii70.
 Boerma, T.J., Ghys, P.D. and Walker, N. (2003) Estimates of HIV-1 Prevalence from National Population-Based Surveys as a New Gold Standard. The Lancet, 362, 1929-1931.
 Mishra, V., Vaessen, M., Boerma, T.J., Arnold, F., Way, A., Barrere, B., et al. (2006) HIV Testing in National Population-Based Surveys: Experience from the Demographic and Health Surveys. Bulletin of the World Health Organization, 84, 537-545.
 Auvert, B., Bollard, R., Campbell, C., et al. (2001) HIV Infection among Youth in a South African Mining Town Is Associated with Herpes Simplex Virus-2 Seropositivity and Sexual Behaviour. Aids, 15, 5885-5989.
 Colvin, M., Abdool Karim, S.S., Connolly, C., Hoosen, A.A. and Ntuli, N. (1998) HIV Infection and Asymptomatic Sexually Transmitted Infections in a Rural South African Community. International Journal of STD & AIDS, 9, 548-550.
 MacPhail, C., Williams, B. and Campbell, C. (2002) Relative Risk of HIV Infection among Young Men and Women in a South African Township. International Journal of STD & AIDS, 13, 331-342.
 Shisana, O. and Simbayi, L. (2002) Nelson Mandela/HSRC Survey of HIV/AIDS: South African National HIV Prevalence. Behavioural Risks and Mass Media Household Survey 2002. Human Sciences Research Council (HSRC) Press, Cape Town.
 Shisana, O., Rehle, T., Simbayi, L.C., Zuma, K., Jooste, S., et al. (2009) South African National HIV Prevalence. Incidence and Communication Survey 2008: A Turning Tide among Teenagers? HSRC Press, Cape Town.
 Vaessen, M., Thiam, M. and Thanh, L. (2004) The Demographic and Health Surveys. In: Household Sample Surveys in Developing Countries: Design, Implementation and Analysis, United Nations, Department of Economic and Social Affairs, New York, 495-522.
 Rust, K.F. and Rao, J.N.K. (1996) Variance Estimation for Complex Surveys Using Replication Techniques. Statistical Methods in Medical Research, 5, 283-310.
 Zuma, K., Lurie, M. and Jorgensen, M. (2006) Analysis of Interval-Censored Data from Circular Migrant and Non-Migrant Sexual Partnerships Using the EM Algorithm. Statistics in Medicine, 26, 309-319.