When sampling methodology is complex, initiatives are employed in statistical analysis to extract the most reliable information from data through the model and its parameters. The goal of this manuscript is to apply the item response theory (IRT) to analyze survey data, and compare the output with one classical test theory (CTT) called logistic regression models as a point of reference.
The sampling methodology used to collect data has a two stage design associated with primary sampling unit (PSU) strata from 15 counties and secondary sampling units (SSU) from 136 road segments within the counties, under National Highway Transportation Safety Authority (NHTSA) guidelines  . If sampling weights are ignored, then the model parameter estimates can be biased  . In fact, since the sample is collected from a two stage stratified sampling design, standard underlying assumptions of parametric statistical models may be violated, and guidelines based on the statistical design cannot be ignored.   and  have given suggestions for such complex methodologies. Other authors have applied the methodology to studies. Our intent is to apply the seat belt sampling methodology to predict the seatbelt usage.   and  have used such methodologies and they concluded that females are more likely to wear seatbelts than males. The relationship between vehicle type and seatbelt use has been explored by   and  who concluded that seatbelt use in pickup trucks is lower than other passenger vehicles.  suggested that passenger and driver use are related.  asserts that the seatbelt use is increased in those states within the United States that have primary seatbelt enforcement laws and actively enforce seatbelt use. Studies have also explored relationships between race, socio-economic status, age, rural/urban environments, law enforcement type (primary, secondary), the amount of fines, and the type of road traveled (primary, secondary, tertiary).  employed a multivariate approach using the aforementioned factors along with cultural variables to explain the differences in seatbelt use between states using self-reported information, direct observation, and crash reports. However, the validity of self-reported seatbelt use in surveys is questionable compared to observed seatbelt usage  . While the methodology is simple to describe, the challenge is found in the statistical analysis tool used to make prediction, especially in the presence of behavioral variables, such as driver gender, vehicle type, traffic volume, road segment length, weather conditions, driver cellphone use, passenger presence, lane, and passenger seatbelt use. The goal is to get meaningful information that can be translated into quantitative measures.  and  propose the addition of a score variable due to the measurement of concern. Those researchers have incorporated latent traits of data in a score function.
The manuscript presents a comparison of the popular logistic regression presented here along suggestion of the Item Response Theory (IRT) model, and its simple version called the Rasch model  .
Moreover, ignoring weights may lead to imperfection in the sample (as departing from the reference population) and serious bias in latent variable models  . To avoid that problem, we apply a weight function.  cautioned about the use of other factors to develop more effective countermeasures for increasing seatbelt use. We propose the weighted logistic and IRT models after variable selections and compare the findings. The manuscript is organized as follows. In Section 2, we present background of data, then build the reference model in Section 3. In Section 4, the weighting scales are built into the models. The IRT model is presented. We end with a conclusion in Section 5.
2. Overview of Data
Data collected in the summers of 2012, 2013, 2014, 2015, and 2016 for Virginia seat belt use is used as evidence. As mentioned in the previous Section, the data is collected under a two stage design. Primary sampling units (PSU) are county aggregates and were stratified using the five-year average annual VMT (vehicle miles traveled) in millions. Out of 97 total county aggregates, 57 account for 87.2 percent of passenger vehicle crash related fatalities. The 57 eligible county aggregates were grouped by VMT into three strata: low, medium, and high. Within each stratum, five PSU’s were selected with PPS where the measure of size (MOS) was the five-year average annual VMT. The PSU sampling weights are calculated by taking the inverse of the five year average annual VMT, and varied from approximately 0.089 to approximately 0.967. Secondary sampling units (SSU) are road segments. Road segments were stratified by type (primary, secondary, and local) and by segment length (short, medium and long) within each county. The eligible SSU were then selected by PPS with segment length as the MOS resulting in 136 selected road sites for observation. The SSU weights are calculated by taking the inverse of the segment length and varied from approximately 0.0001 to approximately 0.1657.
The weighting was added so that information from the whole population would be captured. If the selection mechanism is not informative, the parameter estimates will remain consistent regardless of the weights, and weights should be excluded from the model  . Moreover, if the strata sample sizes are large enough, the parameter estimates are unbiased. In sampling surveys, it is not always possible to determine whether the weights are informative. However, the observations should reflect the sampling weights to avoid biased sampling.
The data collected includes the following observed binary data: driver seat belt use (yes, no), driver gender (female, male), passenger present (yes, no), passenger seatbelt use (yes, no), and visible driver cellphone use (yes, no). The other observed data is categorical: vehicle type (car, truck, SUV, van, or minivan), lane of the road (1 - 5, where lane 1 represents the lane furthest to the right and lane 5 denotes the fifth lane from the right in the direction of travel), and weather (sunny/clear, light rain, cloudy, fog, or clear but wet conditions). The VMT for each site observed is classified (Road Class) within each county aggregate as lower, average, and upper. Vehicle type was assigned in no particular order, and later we reclassified it to describe the size of the vehicle which crudely correlates to seatbelt use. Weather is also not ordered in its assignment, and we reclassify it based on severity and impediment of driving ability. The data set also includes the following continuous variables: VMT, road segment length, and selection probabilities determined in the sampling design stage.
3. Unweighted Analysis and Results
Generalized linear models are usually considered in the investigation of the data. First, a classic linear model was suggested to obtain a general relationship between the response (driver seatbelt use) and predictive variables. However, use of a linear model on binary responses is not recommended  , since predicted values may be outside of the domain of the response variable. From this point forward, a classic model also known as classical test theory (CTT) is considered. We consider first fitting a logistic model to the data.
3.1. Logistic Model
In this model, p = P(Y = 1) is the probability that the driver is wearing a seat belt, and 1 − p = P(Y = 0) is the probability that the driver is not wearing a seatbelt. The initial model is:
Model 1: = β0 + βvXv + βrXr + βgXg + βsXs + βlXl
+ βcXc + βwXw + βppXpp + βpsXps
where β0 denotes the intercept of the model, Xv denotes Vehicle Type (car, truck, SUV, van, or mini-van), Xr denotes Road Classification for VMT (low, average, high), Xg denotes Driver Gender (male/female), Xs denotes the road segment length in mile, Xl denotes Lane in which vehicle observed (right to left), Xc denotes Driver Cell Phone Use (yes/no), Xw denotes Weather (clear, light rain, cloudy, foggy, or clear but wet), Xpp denotes Passenger Present (yes/no), Xps denotes Passenger Seatbelt Use (yes/no). This notation is used consistently throughout this manuscript. The weights
are obtained as
is the selected probability of the selected county, and
is the selection probability of the jth road type selected within the
The estimated non-weighted seat belt use for each year is and
To simplify the model, the logistic fit is processed with stepwise selection at a 0.15 significance level for both entry into the model and retention in the model. The results are verified using forward selection and backward selection options. The three procedures produce the same results.
Analysis of the effects of weather on seatbelt use revealed inconsistent associations between seatbelt use and weather severity for the five years. Further, the selection process does not identify weather as significant for any combined data. Hence, weather has been removed from the model and the analysis repeated. Analysis of the predictor variables reveals a high correlation (Spearman’s correlation coefficient, ) between road segment length and road class which indicates a confounding condition. Other correlations are less than 0.15 and do not indicate the presence of other confounding effects. As a result, road segment length was removed from the model and the analysis performed again.
Table 1 provides the Wald Test for significance in the selected Model with variables as Vehicle type, Road class, driver gender, and so on. For 2012-2013 combined data, all remaining predictors are significant at p = 0.01, while passenger presence is removed due to a p-value > 0.15. For 2012-2014, all predictors are significant at p = 0.05. For the combined 2012-2015 data, predictor variables have p-values < 0.005. For the combined data for 2012 through 2016, all five of the remaining predictors are significant at p < 0.005.
The close agreement between the models may indicate that the aggregate data follows a standard model which also fits the individual data sets. The test of the global hypothesis of null model, shown in Table 2, of versus at least one depending upon the model) indicates significant evidence exists (p < 0.0001) to support the claim that the models are not explained solely by the intercept (i.e. the response is not a constant) for all four presented models which is consistent with the Wald Test results in Table 1.
Computational efficiency is measured by Akaike Information Criterion (AIC) numbers  , displayed in Table 3, which assess the goodness of fit of the model: smaller numbers indicate a better fit. AIC is defined as follows:
where p is the number of parameters in the model, SSr is the residual sum of squares, and N is the number of observations in the dataset.
Table 1. Type 3 analysis of effects.
Table 2. Testing global null hypothesis: β = 0.
The results of the AIC for logistic regression performed on the significant variables identified during the selection process are in the 10 thousands. Since the intercept alone is not a sufficient explanation of the model, we use the values for intercept and covariance. The AIC numbers obtained for individual years are approximately 30% lower than those obtained by  ; however, the combined data is significantly higher. The significantly higher numbers for the combined data indicate a significant amount of variation in the model, or a less than optimum fit.
3.2. Variable Standardization and Reclassification
Since vehicle types are listed in no particular order, vehicle type is reclassified to indicate size of the vehicle which negatively correlates to driver seatbelt use: i.e. in general, the drivers of larger vehicles tend to wear seatbelts less often than drivers of smaller vehicles as suggested in  . Preliminary analysis of the data appears to support this hypothesis, so smaller vehicle types are given a larger value to indicate that the driver is more likely to wear a seatbelt. Table 4 contains the reclassifications of vehicle type. The remaining five predictor variables have positive correlations to driver seatbelt use and reclassification is not necessary. It is known that the variance is larger for population parameters with large values than for population parameters with smaller values. In order to make the variance between variables more homogenous and reduce the overall model variance, each variable of interest was standardized by dividing its value by its third quartile (Q3) in an approach similar to  . Standardizing the variables may affect whether they are selected in the model, so all six of the potential predictors are standardized. The Q3 values of the variables after reclassification are listed in Table 5. Note that the Q3 values are the same for all five years, and
Table 3. Model fit statistics.
Table 4. Reclassification of variables.
thus the combined Q3 values are constant across time.
3.3. Model Fitting after Standardized and Reclassified Variables
The logistic selection process with p = 0.15 for entry and retention in the model is performed on the reclassified and standardized variables. The significant variables indicated prior to standardization in 3.2 above remain significant (Table 6). The model fit statistics are comparable to the previous analysis (Table 7). The global null hypothesis test indicates that the model is not sufficiently described solely by the intercept (Table 8). All variables selected are significant (p-value < 0.0001) for all datasets analyzed. In this analysis, it is reasonable to select the model fit by the combined 2012-2016 data:
Table 5. Third quartiles after reclassification (No weight).
Table 6. Type 3 analysis of effects for standardized and reclassified variables.
Table 7. Model fit statistics for standardized and reclassified variables.
Table 8. Global null hypothesis: β = 0 for Standardized and reclassified variables.
Model 2: = β0 + βvXv + βrXr + βgXg + βlXl + βcXc + βppXpp.
The variable significance is displayed in Table 6, and the fit estimates are shown in Table 7. The AIC and SC numbers remain undesirably large (Table 8) and indicate that reclassification and standardization are not sufficient actions to improve model fit. Therefore, we investigate the cause for the poor model fit.
In all the previous sections, the AIC, BIC and log likelihood have been used as best measures of goodness fit for the most parsimonious models. They turn out to be high, which is an evidence of over-dispersion, which could be an indication there is more variability in the data than expected from the fitted model, which is an indication of a poor fit. Since the sample size is large, the corrected AIC does not lead us to better improvements. Variables have been selected for each dataset and the selection process results in similar models. We will use these criteria as comparisons when adding the weights to the models considered in the next section.
4. Weighted Statistical Models
In all of the above analyses, the weights associated with the data were ignored. However, driver seat belt behavior is intricate and quite certainly involves non-collected data. Ignoring sample weights leads to inflated standard errors and biased estimates  .  provide guidelines for data analysis under weighted and designed data which reduces bias that would result in over sampled strata. The weights are stratum size and length of road segments. The inclusion of weights results in a significantly different model than selected in Section 3 above as inferred by  . Additionally, the goodness of fit criteria is significantly reduced (improved). The sampling plan for the data in this manuscript was developed as a joint effort between two of the authors (N. Diawara and B.E. Porter) and NHTSA. Therefore, in order to correct for bias due to stratum size and length of road segment, we included the weight designed for this analysis in our model, in accordance with NHTSA requirements  as:
In this section, we will compare the results of the analysis based on the sampling weights and validate the appropriateness of the use of the weights.
4.2. Weighted Logistic Models
4.2.1. Model Fitting: Weighted Logistic Regression
Prior to performing analysis on the reclassified and standardized variables, the 75th percentiles for the weighted reclassified variables is determined. The weighted third quartile values are the same as the unweighted values listed in Table 5.
The selection process using the weighted logistic regression model and the SAS® logistic procedure resulted in three significant predictors at p = 0.15: driver gender, passenger presence, and vehicle type for 2012-2013 data. The selection process for both the 2012-2014 data and the 2012-2015 data additionally indicates that cell phone use is significant at p = 0.10. In the aggregate data for 2012-2016, the selection process results in three significant variables at p = 0.05 (see Table 9). There appears to be an increasing significance in the prediction of driver seat belt use by cell phone use (p > 0.15 to p ≈ 0.05) over time. The model is significant as indicated by the global null hypothesis test in Table 10.
There is significant decrease in the AIC when the weights are added to the model, matching in  that, in the context of behavioral ecology, a simple controlled model does not show all the complexity of the data. Table 11 contains the AIC and SC values, which are lower than the corresponding unweighted models by a factor of approximately 20. The weights have improved the accuracy of model as it helps reduce the residual variance.
Figure 1 displays the predicted probability of seat belt use (for drivers using a cellphone with a passenger present) versus the vehicle type for each gender. The same general upward trend exists in the weighted model and the unweighted model but using less predictors. Please note that the authors have only included
Table 9. Type 3 analysis of effects for weighted, standardized and reclassified variables.
Table 10. Global null hypothesis: β = 0 for weighted, standardized, and reclassified variables.
Table 11. Model fit statistics for weighted, standardized and reclassified variables.
Figure 1. Model 3: Multivariate weighted logistic regression on model with p = 0.15 selection (2012-2016 Data).
model but using less predictors. Please note that the authors have only included one chart for this model due to the excessive space required to depict all 24 such combinations.
4.2.2. Model Selection: Weighted Logistic Regression
The final model selected for the 2012-2016 aggregate data is
Model 3: = β0 + βvXv + βgXg + βcXc + βppXpp
where β0, βv, βg, βc, and are the estimates calculated using the weights.
As expected, the combination of the data results in an improvement in the significance of the predictors compared to individual models. However, the models have different selected variables and one of the variables selected for the 2012-2016 combined data has a p-value > 0.05 indicating the necessity for a different analytical method.
One suggestion is to develop an IRT model for prediction of seatbelt use, and it is advisable to include only very significant predictor variables (p ≤ 0.05). All four selected variables in the aggregate 2012-2016 data model have significance levels less than or very close to 0.05. We explore an IRT model using a selection process with p = 0.05 significance on the combined data.
4.3. Weighted Item Response Theory Model
To analyze dichotomous events or polytomous level response data (as usually found in the quality of life field), the item response theory (IRT) model provides a complement to the classical test theory (CTT) as the behavior and characteristic of the driver is not directly understandable. The measurement of driver behavior is not suitable since it is based on qualitative indicators such as the type of vehicle used, and other ad hoc parameters that are not easy to translate into quantitative information to be used in a CTT statistical analysis. Because of that, IRT and its famous Rasch model have also been implemented to measure drivers’ behaviors. The IRT model allows the inclusion of the latent factor common to all drivers that can be described by a score function. We applied such a model based on specified traits that reflect the dichotomy of the data such as gender, and made comparisons. We then compare the efficiency and effectiveness of the overall indicators by computing goodness of fit statistics.
Because the model requires consideration of several conditions, the Rasch model is considered, as it provides a tool to analyze characteristics even when they are latent. Such a model can be included in the class IRT in the framework proposed by  . Driving habits can be seen as a variable which depends on many factors. Our primary focus is on seat belt use and indicators which give additional information to evaluate seat belt use. We propose to extend the theory of logistic regression to include characteristics associated with driver seatbelt use which is translated into the driver’s condition as an associated score. In such a context, the Rasch model (   ) is an option where we can include each driver’s behavior regarding seat belt use. One main concern is the associated measurement of the score. That score is based on the qualitative information to be translated into quantitative measure. Using ideas from  , we develop a score function that can be used to build the sensitive attributes and behaviors of drivers. As mentioned in  , the bias reduction is achieved through appropriate weight adjustments.
A score function is built using a linear combination of significant predictor variables. The proposed score attempts to capture the features of vehicle type driven, driver gender, passenger presence, and driver cellphone use. Those features can alter the probability of seat belt use and they can be seen as sufficient statistics for the response (See  ). In our case, due to the logistic analysis on driver seat belt use, we propose to use a score function composed of driver gender, vehicle type, passenger presence, and handheld cellphone use as follows:
S = Xg + Xv + Xpp + Xc
where Xg = driver gender (male = 0 and female = 1), Xv = size of vehicle driven standardized by the 3rd quartile (1/3 = SUV/Van/Truck, 2/3 = Minivan, and 1 = car), Xc = passenger presence (present = 1 and not present = 0), and Xc = driver cellphone use (no = 0 and yes = 1).
The final model is
The logistic regression analysis yields parameter estimates (standard error) (0.1384) and (0.0609) for the 2012-2016 combined data (Table 12).
The AIC values (Table 13) are comparable to the AIC values in the weighted logistic analysis shown in 4.2.1 indicating a satisfactory fit of the model. The model is significant as indicated by the global null hypothesis test given in Table 14. The odds ratio estimate and its confidence interval are provided in Table 15. Figure 2 shows the regression line and 95% confidence limits for predicted probability of seatbelt use versus the weighted score function. The narrow confidence band and the linear upward trend also indicate a satisfactory fit of the model to the data. All such results conform with the findings by  in the bias reductions even in the nonresponse situation, and provide an improvement on their suggested approach.
Table 12. Analysis of maximum likelihood estimates.
Table 13. Model fit statistics.
Table 14. Testing global null hypothesis: BETA = 0.
Table 15. Odds ratio estimates.
Figure 2. Logistic regression of seatbelt use versus weighted score.
The present IRT model offers many more advantages than the classical test theory (CTT) methods developed in Section 3. The model is parsimonious and allows driver seat belt behavior to be easily estimated from scaled psychometric item measures under a weighted design model.
Driver seatbelt use in the Commonwealth of Virginia may be satisfactorily described using driver gender, vehicle type, passenger presence, and cellphone use in a multivariate logistic model using weights designed specifically for the dataset. However, prediction of seatbelt behavior is more appropriate using item response theory. As such, we have endeavored to build a score function considering driver gender, vehicle type driven, passenger presence, and cellphone usage by applying the IRT model with weights within the model. Fitting a weighted model results in significant improvements in goodness of fit statistics, such as AIC numbers, by factor of approximately 20.
We suggest that a weighted IRT model is more appropriate and it may also potentially include other factors. Such a model could be used to develop programs and more applications of the IRT models.
The authors are grateful to the referees and editor for their detailed suggestions, comments and insights, which improved the quality of the paper considerably. The research was made possible by financial support from the Virginia Department of Motor Vehicles via funding from the National Highway Traffic Safety Administration.