Item Response Theory (IRT) is a psychometric tool originally applied in the field of education and currently used in multiple fields to yield categorical outcome data  . We use IRT models to analyze health and wellness data based on questionnaire-like variables with numerous categorical responses. While basic frequentist descriptive analysis presents statistics on variables and categories independently, an IRT model allows researchers to analyze descriptive aspects of these variables by their latent traits as well as their relationship with other variables and the data set as a whole based on individual responses  . IRT modeling quantifies the latent traits as three parameters: the difficulty or threshold parameter, the discrimination parameter, and the ability parameter  . IRT is not a new theory, nor is it the only tool that can be applied to analyze health assessment data. Common alternative methods include classical test theory (CTT), such as Cronbach’s coefficient alpha, and factor analysis  . However, IRT has proved to be popular because of its adaptability and its effectiveness in designing and evaluating questionnaires and its use for scoring respondents   .
The main concept which is the foundation of IRT is that there is a link between item responses and the various characteristics measured by the test  . Based on this concept, IRT suggests that underlying respondent performance on a set of items is a set of personal latent characteristics that can be estimated based on the respondents answers to the items and questions  . From these estimates, IRT produces a generalized linear model that can be used to perform further analysis.
IRT was first developed for the field of education in order to calibrate and evaluate tests and score students based on their ability and other latent traits  . However, IRT has expanded to more fields, from psychometrics to health assessment and clinical research  . Many studies have used IRT to create item banks, which comprise of a collection of already IRT-calibrated questions that are shown to be the best in defining a domain within health measurement  . One example of this is the patient-reported-outcomes measurement information system (PROMIS), which is part of the NIH Roadmap Initiative. This system applies both IRT and computerized-adaptive testing (CAT) to improve the precision and efficiency of health measurement, both by reducing the number of questions needed and the number of subjects surveyed  .
IRT has also been used to assess already established health measurement tools. For example, a study by Hartman et al.  study used IRT to analyze the DSM-IV abuse and dependence criteria amongst 5587 children of ages 11 - 19. Specifically, the study aimed to answer three questions: 1) do the criteria (dependence and abuse) represent two different levels of substance involvement severity? 2) to what degree does the criterion assess cannabis abuse/problems? and 3) do the criteria work similarly across different adolescent groups? Using IRT, the study concluded that dependence and abuse were not separate constructs for cannabis problems, and that the criteria needed refinement to better assess cannabis abuse and dependence. Other studies have also used IRT to refine established health measurement tools. The results of a 1996 study analyzing measurement instruments for community-living individuals with cerebral palsy and spina bifida found that combining certain items from the Functional Independence Measure and instrumental activity measure was useful for disability assessment  .
IRT method exhibits unique characteristics not found in traditional approaches such as factor analysis or Cronbach’s alpha. One of the principal benefits of using IRT over other classical test theory methods is that IRT takes into account the latent and invariant traits of both the item measurement and the respondent  . For example, IRT models simultaneously measures the latent proficiency or ability of an individual subject in answering items along with the difficulty of the item being answered   . What makes estimates from IRT useful is that the item parameters are not test dependent, and that the item statistics are independent of individual ability level; rather, item statistics and ability are measured on the same scale, thus allowing predictions of an item or group of items for individuals or groups of individuals  . IRT also takes into account the dependence of an item on sampled individuals. Thus, these strengths allow results to be both more precise and generalizable  .
IRT model is also able to detect variability in responses between groups, also known as differential item functioning  . This information can suggest whether a test can be applied to different sub-samples or a group. From testing differential item functioning (DIF), researchers can then reduce bias and increase validity of the model  . IRT also allows for more flexible and precise score equating  . This score equating not only works between items within a test, but also between multiple scales and questionnaires in order to create a sort of conversion table by which to analyze results  . Due to the IRT’s ability to equate scores, it also has the benefit of improving already existing measures. .For instance, it can provide information in identifying where along a latent trait scale the measurement provides little information and needs improvement  .
IRT also has the ability to identify clinically significant differences or change over time  . Due to the fact that IRT estimates of latent traits have a direct effect on probability of item response and the fact that items and parameters are measured on an equated scale and linked, changes over time and point estimates have clinical meaning  . Thus, researchers can use IRT to determine clinically significant thresholds of change in clinical parameters.
There are multiple models which one can apply when performing IRT. Based on the nature of the measured item outcome, such as dichotomous or ordinal, IRT provides alternatives to achieve the best fit for the data and the most representative results. For example, the one-parameter model, also commonly known as the Rasch model, applies to dichotomous item responses as a function of the latent trait and the difficulty of the item, thus allowing items to vary in difficulty but assumes that all items discriminate equally (i.e. equal slopes for each item)   . Adding further parameters to measure discrimination and the impact of chance allows one to account for more variability in the data, thus increasing validity of results.
When comparing functionality and convenience of IRT to other common alternative methodologies, IRT also presents numerous advantages. Firstly, IRT provides robust estimates and models  . IRT also applies multiple tests and functions at once, thus proving a more time-efficient method for researchers. Other methods such as factor analysis and Cronbach’s alpha, only fulfill certain functions. For instance, Cronbach’s alpha only tests the validity of model results, and factor analysis only allows researchers to pick important variables but does not provide a model with which to analyze data and draw inference. However, IRT does perform these functions. Another significant distinction between IRT models and the traditional approaches is that IRT model allows researchers to rank individual respondents based on their answers to items, thus indicating individual risk   .
Despite many strengths of IRT compared to other classical methods of test analysis, this theory does have its own weaknesses. Some of the IRT’s limitations lie in its assumptions that must be satisfied: 1) unidimensionality, 2) local independence of items, 3) and item parameter invariance  . However, these assumptions may not always be confidently made   . Unidimensionality and local independence can be tested using graphs such as screen plots or weighted least squares means and variance estimator for categorical data  . However, these assumption and tests are never conclusive as unarguably true, but instead as an approximation  . For unidimensionality, the assumption cannot be strictly true because several latent and test-taking factors always affect test performance to some extent  .
Another limitation of IRT is that the model selection and building is not a straightforward process. When choosing an IRT model, the main objectives are to find a model that fits the data, properly estimates model parameters, and is used correctly  . There are multiple modeling schemes to choose from, such as the Rasch or graded response model. Hard consideration and comprehensive knowledge is needed in order to not only perform IRT testing, but also to consider and interpret results   . Results of IRT also cannot indicate how to improve or write items, or what items can fill a noticeable gap in the item difficulty range  .
Using IRT also poses a practical problem. Utilizing IRT is limited to finding a statistical program that will perform the function. Learning and implementing these programs is not easy  . One needs extensive knowledge of statistical program coding for such programs as R, SAS, Stata, or Winsteps to name but a few. Sometimes there is not a direct command to perform IRT, thus requiring extensive coding  .
Despite the limitations, IRT is an efficient and beneficial tool to analyze not only testing data, but also questionnaire, measurement, and multiple other data forms. Next we illustrate the use of IRT models using data from a technology assisted health coaching program, called m. chat.
1.1. M. Chat Program
The m. chat program is funded by a Medicaid 1115 Waiver to the State of Texas. It is geared towards permanent supportive housing residents in the city of Fort-Worth, Texas with the goal of improving key health indicators of the participants by providing in-person health coaching. Subjects in the program are adult residents of permanent supportive housing who were Medicaid-enrolled or low income uninsured and English speaking. In addition, the subjects reported at least one of the following mental health conditions: having been prescribed a medication for emotional or psychological problem, receiving a pension due to psychiatric disability, reporting hallucination, or a scoring greater than 9 on the Patient Health Questionnaire (PHQ-9) depression screener, indicating moderate to severe depression. Participants were surveyed on domains which comprise general health: diet, social habits, leisure practices, mental health, substance abuse, self-sufficiency and medication adherence. Overall, 90 baseline items were included in the analyses. The program has been described in further details by Walters et al.  .
1.2. Item Response Theory Model
In IRT it is assumed that there is link between the item responses and the various characteristics measured by the test  . Based on this concept, IRT suggests that underlying respondent performance on a set of items is a set of personal latent characteristics that can be estimated based on the respondents answers to the items and questions.
To explain the parameters and their role in IRT modeling, we will focus on two specific models which we utilized: the Rasch model and the Graded Response Model. The Rasch model, or the one parameter logistic model, is applied to binary data. The Rasch model, compared to various other IRT models, aims for simplicity more than fitness. The model is as follows:
where i is an individual subject and j equals a specific category within a question. The model results in a probability of a Bernoulli random variable with θi representing the proficiency or ability of an individual subject and bj being the difficulty of the specific category. In comparison, the Graded Response Model takes is a multi-parameter model and it can accommodate response with more than two categories.
The Graded Response Model applies specifically to ordinal data of more than two categories and builds upon the Rasch model to calculate parameters and probabilities for question j by subject i for category level k. Whereas the Rasch model strived for simplicity, the Graded Response Model tries to fit a model to the data utilizing more descriptive parameters. The primary assumption of the Graded Response Model is that the item discrimination and difficulty is not equal amongst items. The model can be written as follows:
for k = 1, …, n. Then for the last value the model finishes with
Here θi represents the subjects ability which remains the same, and b(jk) continues to represent the difficulty parameter. However, this parameter now includes the step size parameter, d(jk), to create the equation . The parameter, d(jk), gives us the latent trait location where one category becomes more likely than the one before it. Finally, the Graded Response Model includes aj, the slope or discrimination parameter for each question.
A benefit of the IRT analysis IRT analysis is that all items are placed on the same metric. As a result, direct comparison of the items measuring a variety of domains can be compared to each other. The results of our analysis are presented in Tables 1-8 for the eight varying domains. Table 1 shows items from a modified dietary screener questionnaire  . Table 2 shows items from the Mea
Table 1. Estimates of item parameters (category thresholds, item locations, and discrimination) for items in the DIET domain.
Table 2. Estimates of item parameters (category thresholds, item locations, and discrimination) for items in the MAPA domain.
Table 3. Estimates of item parameters for items in the ISEL domain.
ningful Activity Participation Assessment (MAPA)  . Table 3 shows items from the Interpersonal Support Evaluation List (ISEL)  . Table 4 shows items from the abuse section of the Addiction Severity Index  . Table 5 shows items from the Quality of Life Enjoyment and Satisfaction Questionnaire (Q-LES-Q)  . Table 6 shows items from the Inventory of Drug Use Consequences
Table 4. Estimates of item parameters (category thresholds, item locations, and discrimination) for items in the “abuse” domain.
Table 5. Estimates of item parameters (category thresholds, item locations, and discrimination) for items in the “QLESQ” domain.
Table 6. Estimates of item parameters (category thresholds, item locations, and discrimination) for items in the “INDUC” domain.
Table 7. Estimates of item parameters (category thresholds, item locations, and discrimination) for items in the “PHQ-9” domain.
(INDUC)  . Table 7 shows items from the Patient Health Questionnaire (PHQ-9)  . Table 8 shows items from the Morisky Medication Adherence Questionnaire (MMAQ)  . Most scales had been adapted from the original to
Table 8. Estimates of item parameters (category thresholds, item locations, and discrimination) for items in the “Morisky” domain.
fit the target population. Overall, 88 items were analyzed. These tables include the parameter estimates from IRT analyses consisting of the threshold, discrimination, and location parameters. The threshold parameter indicates at what point on the ability spectrum does the sample exhibit equal probability of answering a categorical response versus the next subsequent categories. The ability spectrum is modeled using a standard normal distribution, where θ = 0 equals average ability for the sample. Large negative or positive estimates indicate less or greater ability respectively. Calibration of items places all items on the same ability metric, thus allowing comparison across items. Not all questions in a questionnaire have the same scaling and categorization, so attentive interpretation is needed. The discrimination parameter indicates the ability of the item to discriminate groupings within the sample. Finally, the location parameter equates to the average threshold of the item, indicating the difficulty of the question for the sample to be answered “correctly”. A lower location parameter indicates that it is easier for the sample to answer the presumed “correct” answer to the behavioral questions, while a higher location parameter indicates more difficulty. The location parameter is equivalent to the item difficulty parameter in dichotomous models.
We included data collected from 416 participants at baseline in the analysis. The average age was 50.65 years. The sample consisted of 41.61% White, 51.77% Black/African American and 6.62% others. The average BMI was 31.41 with 23.40% in the normal category, 22.93% in the overweight category and 53.66 in the obese category. The burden of disease was also significant in the sample with 5.7% reporting diabetes, 26.41% reporting asthma, 4.94% reporting breathing disorders and 88.54% reporting depression, anxiety or emotional disorder. More than 50% of the sample reported multiple chronic health conditions.
Of all items, the most discriminating were those under the domain “INDUC”, a series of questions which ask about negative consequences of alcohol and drug use, with an average discrimination estimate of approximately 4.99. The three most discriminating items were: I was unhappy because of my drinking or drug use (Estimate: 9.33); Drinking or drug use got in the way of my growth as a person (Estimate: 8.89) and I felt guilty or ashamed because of my drinking or drug use (Estimate: 7.01) In comparison, the least discriminating items were the diet habits, with an average estimate of −0.476. The three least discriminating items were: “How many times a week did you eat desserts and other sweets (not the low-fat kind)?” (Estimate −1.27); There are people I can count on in this neighborhood (−1.00) and People in this neighborhood help each other out (−0.80). The low discrimination estimates suggest that there is high probability that subjects at any ability level will endorse any level of categorical responses. In other words, no one group of subjects is more or less likely to answer in a certain category. As a result, these questions hold very little information about the sampled individuals and their behavioral habits. In comparison, the most discriminating items can be regarded as holding the most information and have the ability to discriminate subjects into characteristic groupings based on their responses.
The location parameter provides the extremes of subject ability. The two items with the lowest estimate for the location parameters are: How many servings of vegetables did you each day? (Estimate: −66.00) and How many servings of fruit did you eat each day? (Estimate: −26.35). This indicates that very few study subjects were eating the daily recommended servings of vegetables and fruits respectively. Conversely, the item with the highest estimate for the location parameter is: During the past year has your drug or alcohol use led to any problems with the legal system such as drunk and disorderly arrests, being pick-up for drug possession, etc.? The estimate for this item is 3.322 indicating subjects who had experienced legal problems as a result of their substance use had significantly different substance use patterns than people who did not endorse this item.
IRT methods have important applications in health outcome measurements. For the most part, statisticians are still using traditional methods including factor analysis, principal component analysis, discriminant analysis together with Cronbach’s alpha to build test questionnaires, identify highly discriminating items and to evaluate the internal validity of test domains. In this paper, we have illustrated a method which has already found wide-spread applications among researchers in education, behavioral health and psychometrics as an alternative to commonly used multivariate methods. With the availability of a new procedure in SAS (version 9.4) to conduct IRT analysis along with multiple open source software, statisticians involved in health outcome measurement research can benefit from the use of IRT method.