Dengue Fever is fast emerging pandemic-prone viral disease in many parts of the world. Dengue flourishes in urban poor areas, suburbs and the countryside but also affects more affluent neighbor hoods in tropical and subtropical countries. Dengue is a mosquito-borne viral infection causing a severe flu-like illness and, sometimes causing a potentially lethal complication called severe dengue. The incidence of dengue has increased 30-fold over the last 50 years. Up to 50 - 100 million infections are now estimated to occur annually in over 100 endemic countries, putting almost half of the world’s population at risk  .
Tropical countries are the most heavily affected due to environmental, climatic, and social conditions. Studies of climatic variables can improve knowledge and prediction of epidemic seasonality. The climate is an important factor in the temporal and spatial distribution of vector-transmitted diseases as dengue fever  .
Many works sought to identify climatic influences on dengue, and to evaluate the ability of the climate-based dengue models to describe associations between climate and dengue, simulate outbreaks by generalized additive models―GAMs  . This model provides a flexible method for identifying nonlinear covariate effects in exponential family models and other likelihood-based regression models. For this, it used a degree of freedom estimate to assess the importance of covariates based on the expected decrease in the deviance due to smoothing, computable from the trace of the appropriate smoother matrix  .
We assessed the potential contribution of climatic variables on Dengue Fever (DF) incidences based in GAM, according Hastie and Tibshirani (1990)  , and we provided suggestions to improve their performance generated from the statistical analyses of the direct and indirect associations.
Mordecai et al. (2017)  used generalized models associated with R package visreg to understand the impact of temperature on transmission of Zika, dengue and chikungunya. Specifically, Oliveira (2016)  used visreg associated with GAM to understand the effect of temperature on ovulation by Aedes aegypti in Rio de Janeiro. It wasn’t found in the literature, to date, a study involving GAM and visreg package associated with relative humidity.
Ferreira et al. (2017)  used (Logistic Regression and) GAM associated to Binomial Negative distribution and model offset to understand DF cases relationship meteorological variables, specifically, temperature, rainfall and humidity.
This work aims to identify the risk of DF incidence by the occurrence limits parametrization of climatic variables as a function of the time (months and years), in capitals of the NEB, from January 2001 to December 2012, as from visualizing the fit of regression models arising from of GAM, assuming Poisson Distribution, by cross-sectional plots using two-dimensional contour, by “visreg” package function.
To understand the risk of DF incidence by the occurrence limits parametrization of climatic variables on capitals of the NEB, we conducted the GAM analysis by average monthly data observed from 9 capitals of Brazilian Northeast (NEB), in the period of January 2001 to December 2012. These capitals and their respective codes of the Federative Unit (referring to their States): Aracaju-SE, Fortaleza-CE, João Pessoa-PB, Maceió-AL, Natal-RN, Recife-PE, Salvador-BA, São Luís-MA and Teresina-PI, according Figure 1. The data were provided by:
1) Climatological variables: rainfall, in mm, (PRP); minimum, average and maximum temperature, in ˚C, (respectively, T-min, T-mean and T-max); relative humidity, in % (RH); all collected by Meteorological Databank for Education and Research―BDMEP1, from the National Institute of Meteorology―INMET. From these variables collected, we calculated:
a) Vapour pressure deficit (VPD) and saturated vapour pressure deficit (SVPD), all according Allen et al. (1998)  ;
b) Evapotranspiration of Reference (ETO), according Thornthwaite (1948)  ;
c) Annual and monthly heat index, respectively, HI-a and HI-m; as well as, its Function (HI-f), both according to Steadman (1979)  ; and
d) Human comfort index (HCI), according to Rosenberg (1983)  .
2) DF cases collected by the site SINAN-Net2, from the Departamento de Informática do Sistema Único de Saúde―DATASUS; and transformed into DF Incidence Rate; and
3) Annual population size for each of the nine capitals studied, collected by the Sistema de Recuperação Automática―SIDRA3, from the Brazilian Institute of Geography and Statistics―IBGE.
The monthly reporting DF cases were converted to DF incidence rates, which, according to the Ministry of Health  , this is defined as the number of confirmed cases (classic and hemorrhagic DF), by 100,000 people in certain geographic space and the current year, and calculated according to Equation (1):
DF incidence are classified by occurrence bands, as criteria of the National Program for Dengue Control―PNCD/MS  , which considers: 1) low incidence = 0| … 100; 2) average incidence = 100| … |300; and 3) high incidence = 300 … ∞.
All analyses were conducted using the R-Project Software, Version 3.0.34.
2.1. Generalized Additive Model
The class of models known as generalized linear models, or GLMs, was formally introduced by Nelder and Wedderburn (1972)  . Considering the DF incidences (Y) a response random variable or mean dependent variable, and the
Figure 1. Geographical and political map of the NEB region. Cartographic base: IBGE  .
temporal (months and years of the time series data) and climatic variables ( ) a set of predictors or independent/explanatory variables, a regression procedure can be viewed as a method for estimating the expected value of Y given the values of . The standard linear regression model assumes a linear form for the dependency, according Hair Jr. et al. (2005)  , described as:
where and . Given a sample, estimates of are usually obtained by the least squares method.
According Hastie and Tibshirani (1990)  , GAM consist of a random component, an additive component, and a link function relating that two components, like generalized linear models (GLM). The response Y, the random component, is assumed to have exponential family density:
where is called the natural parameter and is the scale parameter. The conditional mean μ of the response variable x to the linear predictor is related to the set of covariates by a link function g. The quantity:
defines the additive component, where are smooth functions, and the relationship between the conditional mean and the linear predict is defined by . The most commonly used link function is the canonical link, for which . Assuming that μ(x), is the mean of the Poisson distribution, the dependence of μ(x) and independent variables , the link function for the Poisson model is the log function . According Hastie and Tibshirani (1986, 1990)   , the generalized additive model (GAM) fits a response variable Y by a sum of smooth functions of the explanatory variables, for i = 1, ..., p by modeling the dependency as:
where are smooth functions, and .
In order to be estimable, the smooth functions have to satisfy standardized conditions such as . GAM extends the parametric form of predictors in the linear model to nonparametric forms. Assuming that Y is normally distributed, an additive model is defined as
GAM and GLM can be applied in similar situations, but they serve different analytic purposes. GLM emphasizes estimation and inference for the parameters of the model, while GAM focus on non-parametric data, and this is more suitable for exploring the data and visualizing the relationship between dependent and independent variables, considering the estimation of the smoothing terms in GAM, described in Equation (6)  .
The spatial distribution was modeled using a bi-dimensional smooth function. A smoother is a tool for summarizing the trend of a response measurement Y as a function of one or more predictor measurements . An important property of a smoother is its nonparametric nature. It does not assume a rigid form for the dependence of Y on , producing an estimate of the trend that is less variable than Y itself, since of penalized least squares method. Each smoother is controlled by a single smoothing parameter, specificity in the model or choose it automatically by the generalized cross validation method    . The GAMs used in this work included a set of directly observed covariates and an s spline smothing function, as depicted in the equations below:
where is the response variable, in this work the Dengue incidence simulated index, ’s are the slope coefficients of the model, so is the adjusted odds ratio, are the climatic variables at the individual and household levels as factor of the monthly lags in 0 - 12 times; and are s spline smooth function, and are the residuals. All covariates with a p-value ≤ 0.001 in the climatic variable univariate analysis were considered with high significance in the model.
2.2. Chi-Squared Statistic
According Zuur et al. (2007)  , this test is used for comparing models in GLM and GAM to analyze if there is no overdispersion. The chi-square test is one of the most popular hypothesis tests. The Chi-squared Statistic is a measure of how similar two categorical probability distributions are to each other. If the two distributions are identical, the chi-squared statistic is 0, if the distributions are very different, some higher number will result.
2.3. Package “Visreg”
This interface was used in this work for visualize the fit of regression models arising from of GAM, as from constructing surface by cross-sectional plots using two-dimensional contour or perspective plots. In addition to estimates of this relationship, the package also provides pointwise confidence bands and partial residuals to allow assessment of variability as well as outliers and other deviations from modeling assumptions   . The contourlines with high relative risk of DF incidence (Dengue RR) presented in the “visreg” plot were identified on the maps and their climatic limits observed in the model parameterization were considered, as areas with high occurrences of dengue rates.
2.4. Pearson’s Correlation Coefficient
Pearson correlation coefficient (r)    was used for measuring direction and degree of linear association between dengue and climatic variables, by each capital of the Brazilian Northest. According to Bewick et al. (2003)  , r can be given by:
where Pearson correlation coefficient or product moment correlation coefficient (r) is a measure of shared variance between two variables, and based in their averages and , and their standard deviations Sx and Sy. The sign indicates a positive or negative direction of the correlation, and the value suggests the power of the relationship between the variables, which value r can vary from −1 to +1, indicating a perfect and very strong positive linear relationship (r = +1), a perfect and very strong negative linear relationship (r = −1), or no linear relationship (r = 0) between the variables    .
In Table 1, it is observed the Pearson’s correlation coefficient (r) with respective
Table 1. Pearson correlation coefficient (r) with respective 95% confidence interval (CI) and p-value, between DF cases and climatic variables, by capital of the NEB.
95% confidence interval (CI) and p-value, between DF cases and 12 climatic variables, on capital of the NEB. The relative humidity presents the best correlation with DF cases for capitals analyzed, at an absolute average rate of 24.18%, with high significance (p-value < 0.001) observed in four capitals each one. Low correlation is observed with Human Comfort Index and that DF cases, at an absolute rate of 9.1%. In Teresina-PI, there are the best correlations compared to the other capitals tested, at an absolute average rate of 26.67%, and high significance (p-value < 0.001) observed to seven of 12 climatic variables in relationship DF cases. Already, in Aracaju-SE, Recife-PE and Salvador-BA, there are the lower absolute mean correlations and respective no significance p-value observed, suggesting that there are other factors involved in the increase of their DF cases.
Table 2 presents the parametric coefficients of GAM between DF incidence and relative humidity over the period of one year (0 - 12 time-lags), using temporal variables (months) in Teresina-PI and São Luis-MA cities. In Teresina-PI,
Table 2. Parametric coefficients of GAM between DF incidence and relative humidity over the period of one year (0 - 12 time-lags), using temporal variables (months) with s term splines smooth, in Teresina-PI and São Luis-MA, in the period from 2001 to 2012.
SE = standard error; z = z-value score; Pr(>|z|) = significance score Z; Sig = significance level: considering “***” when z-value is ≤0.001 (result is “highly significant” with 99.9% of the hypothesis tested being true; that is, the probability (Pr) of the error was less than 0.1%); “**” ≤0.01 (99% of the hypothesis tested is true); and “*” ≤0.1 (9% of the hypothesis tested is true).
GAM shows high significant (p-value < 0.001) association between DF incidence and relative humidity over a range of time-lags 0 - 2, 4 - 5 and 9, being the lag 2 the most significant, with the largest z-value (z = 6.802). Already, in São Luis-MA, the simulated GAM presents high significant level (p-value < 0.001) association between DF incidence and relative humidity over a range of time-lags 0 - 3 and 11, being the lag 1 the most significant, with the largest z-value (z = 7.993).
Table 3 shows the adjust coefficients for GAMs simulated in lags 0 and 1 with DF Cases (using logarithmic function of the population) and DF Incidences, assuming a Poisson distribution, in Teresina-PI and São Luis-MA, in the period from 2001 to 2012. The largest effective degree freedom (edf) values in DF cases simulations indicate nonlinear data when compared to DF incidences. Already, high values of the mean square error (Chi.sq), also simulated with those cases, characterize the overdispersion data. Although the models with DF cases have better fit of the explained deviance; however, your BIAS are extremely high, making the models with DF incidences more parsimonious and therefore more suitable for use  . In Teresina-PI, the modeling by GAM with relative humidity over a time-lag 2 explain 82.3% of the deviance on DF incidences while São Luis-MA over a time-lag 1 explain 78.0% of the deviance on DF incidences, with significant effects in the adjust coefficients with low effective degree freedom, respectively, 6.067 and 7.276; and low estimate of the intercept and respective z-value, making it the best simulated model. In the lag 0 (no lag effect), both models presented the best estimate and z-value, although they had the lowest R-adjusted between the variables measured, 0.699 in both.
Figure 2 shows the distribution of DF incidence and relative humidity as
Table 3. Parametric coefficients of GAM between DF and relative humidity, for time-lags (0 - 1 lags in São Luis-MA and 0 - 2 lags in Teresina-PI) using temporal variables (months) with s term splines smooth, simulated with DF cases (with logarithmic function of population) in the period of 2001 and 2012.
R-sq = R square adjusted; DE = explained deviance; edf = effective degree freedom, chi.sq = quadratic mean error; SE = standard error; z = z-value score.
Figure 2. Seasonality effect, by boxplot, on DF incidence (DF Cases/100.000 inhabitants) in function of months (A and E) and years (B and F); and relative humidity (%) in function of months (C and G) and years (D and H), in Teresina-PI and São Luis-MA, respectively, from January 2001 to December 2012.
time function, in Teresina-PI and São Luis-MA. The seasonal trend of that incidence over monthly and annual time-frequency was observed.
In relation to Teresina-PI, Figure 2(A) shows an increase in the DF incidence from the month of January with peak and highest mean in May, that coincides with the lagged relative humidity in 2 months or March (Figure 2(C)); declining from this month with lower rates in December, that coincides with the lagged relative humidity in 2 months or October (Figure 2(C)). It is also observed three annual periods for the occurrence of the DF epidemiological cycles, with peak in 2010-2011, 2007 and 2001-2003, according Figure 2(B).
In relation to São Luis-MA, Figure 2(E) shows an increase in the DF incidence from the month of January with peak and highest mean in May that coincides with the lagged relative humidity in 1 month, according to Figure 2(G), declining from this month with lower rates in November and December. The highest occurrence and average of DF incidences were recorded in the years 2011, 2007, 2010 and 2005, in this descending order, Figure 2(F).
Figure 3 shows the visualization of the effect of simultaneous variance between relative humidity and time (months and years, Figure 3(A) and Figure 3(B), respectively), in relationship to DF incidence risk (Dengue RR), simulated on DF incidences, from January 2000 to December 2012, by “visreg” function on simulated regression GAM using penalized s splines smoothing, in São Luís-MA. Figure 3(A) shows a large nucleus limited to between 87.0% and 90.0% relative humidity between August and October months, with high relative risk Dengue (RR = 5.0), that is, high DF incidences. Comparing this figure to contour in function of the years, Figure 3(B), 3 nuclei are observed, characterizing the years of highest DF incidences, being 83.0% and 90.0% of relative humidity range highly significant to occurrences those incidences (RR ≥ 4.0).
Figure 3. Visualization of the relationship between relative humidity on lag 1 and temporal variables (months and years), as response to Dengue Relative Risk or Dengue RR (simulated on DF incidences) by GAM regression with Poisson distribution using penalized s splines smoothing, in São Luis-MA, from January 2000 to December 2012. The legend presents in a gradual degree of Dengue RR, which ranges from −1 to 5 in Figure 3(B), with 5 being the positive chance of having incidence of dengue in the studied population increased by 5x, while −1 would be the possibility of RR in −1x.
All numerical output of GAM and respective intercepts in the capitals of the NEB have obtained setting on p-value of 0.001. The capital of João Pessoa-PB is the one with the smaller values of mean squared error (Chi.sq); in other words, the values of the estimated parameters of climatic variables around the true value of DF cases present a greater accuracy and precision in the quality of response by GAM. According Bolker (2008)  , the mean square error represented by the sum of the variance and bias square indicates the quality of an estimator and shows the total change around a true value, in this study, the DF cases.
We found a high correlation of DF incidence with relative humidity is lagged in 1 and 2 months, respectively, in São Luis and Teresina cities. Wu et al. (2007)  observed cross-correlation with statistical significance between DF incidence and relative humidity over a range of time-lags from −1 to −3 months, above all the most dominant effect at a lag −2 months (r = 0.202, p < 0.005).
Ehelepola and Ariyaratne (2016)  evidenced in their study a median increase in 7x of dengue incidences, for a relative humidity of 86%, according to figure 6 of that study. Neto and Rebêlo (2004)  , studying the association between dengue cases and climatic variables in São Luis-MA, from 1997 to 2002, verified that dengue cases are directly related to the increase of precipitation and relative humidity, with a positive correlation this variable of 76.0% (r = 0.76; p < 0.05). In addition, the authors identified peaks of relative humidity in the months of March and April, an average variation from 85.6% (March 1999) to 89.3% (April 1997), while the highest percentage of dengue was presented in May, with an average percentage of 20.2% of cases recorded. While in this study, we identify an r-adjusted between these variables of 0.75 for a −1 month lag of relative humidity in relation to dengue.
The formulation of GAM model is nearly exactly the same as for GLM. These models use all the same families and link functions; but GAM is wrapping the predictors in a non-parametric smoother function, in this paper, specifically, the s spline. The GAM fit is more sensitive to minimizing deviance (higher wiggliness) than the default fit of the loess function. This model is also able to minimize deviance based on the logit transformation. The model output shows that an overall (parametric) intercept is fitted (the mean) on the scale of the logit transformation (logarithmic population of the capitals studied).
Modeling by GAM, assuming a Poisson distribution, explained 82.3% of the deviance of DF incidences, and significant effects were found in the estimates of all climatic variables on dengue; however, the high values of the effective degrees of freedom (edf) of smooth functions indicate that the association between dengue and climate is highly nonlinear. The estimate initially found, by the GLM and GEEGLM models for these studied variables, was too high, indicating the overdispersion data, however regressions by GAM reduced significantly excess dispersion presented in the proportion of deviations from the response shown in simulations by GLM and GEEGLM, i.e., not shown here. Our results were robust to other model specifications with different controls for long-term and seasonal trends. It is suggested that the models proposed in this paper are used by surveillance agencies for planning, prevention and control of Dengue Incidence.
From 12 climatic variables, it was verified that the relative humidity was the one that obtained the highest correlation to dengue in six of nine capitals of the NEB, with high significance (p < 0.001) in Teresina-PI, Fortaleza-CE, Natal-RN and Maceió-AL. Afterwards, GAM associated with visreg was applied to understand the effects between them. March and April months show the sensibility of the use of GAM for the analysis of that correlation. Relative humidity explains the dengue at an adjusted rate of 78.0% (in São Luis-MA) and 82.3% (in Teresina-PI) delayed in, respectively, −1 and −2 months.
3URL: < http://www.sidra.ibge.gov.br/bda/popul/ > .