Item response theory (IRT) comprises a collection of statistical models that define the relationship between a continuous latent variable and the characteris- tics of items to estimate the probability of endorsing an item at a certain scale  . Therefore, the IRT requires and employs precise measurements of the items used in modelling categorical dichotomous or polytomous data, whether nominal or ordinal, through the use of linear and nonlinear latent trait models   .
Item response theory models are seen as family members of factor analysis and have important applicability in computer adaptive tests today, as they do not focus only on the general scale level, but also in terms of items  . Thus, IRT allows scales to be elaborated, revised and optimized for specific uses, employing patterns of responses in probabilistic terms. Item Response Theory is also often designed to measure skills or abilities highlighting features that cannot be observed directly  .
To apply IRT, it is necessary to meet three epistemological assumptions, name- ly: the dimensionality of latent space, which is usually one-dimensional, i.e. a single latent trait variable is sufficient to explain the common variance between the answers to the items, determining how well IRT models of varying dimensionality fit to the data  ; stochastic local independence, where the responses are considered to be independent once the level of θ is fixed   ; and finally, another assumption entitled monotonicity, implies that the performance item is monotonically related to the capacity/ability of the individual. Performance test items inherently satisfy this assumption; therefore, monotonicity is implicitly assumed.
Item Response Theory is a breakthrough in the measurement processes, especially as it enables the existing dimensions in the evaluation of specific characteristics to be expressed in a single latent trait scale  . Therefore, IRT allows the functional capacity index (FCI), the latent trait proposed in this study, to measure different elderly persons or the same individuals in various health conditions and have their functional abilities compared through common items. Thus, IRT enables the elderly to successfully perform the same number of activities of daily living, but distinct activities with different functional capabilities are also estimated  . This is the most innovative feature of this approach, as its central element is based on evaluating the activities and not only the test as a whole.
The aim of the present study was to apply IRT to measure the FCI as an im- portant measure of the component of positive aspects of functionality among elderly people, contributing to the expansion of the use of this tool by creating shorter and more precise scales for analysis of health situations.
2.1. Study Design
An individual, observational and cross-sectional study was performed. The study sample consisted of 41,269 elderly persons aged 60 or more from the 27 federal units that participated in the National Household Sample Survey (PNAD 2008 Po- pulation) conducted in Brazil. The weights and the strata of the PNAD sampling plan were incorporated into the IRT modelling process, avoiding biased estimates of population parameters and ensuring the representativeness of the sample. Because it is data without identification of survey participants and the public domain, this study was exempted from submission to the Ethics Committee, in accordance with the recommendations of the National Research Ethics Commission.
2.2. Measured Variables
The latent trait, FCI, consists of seven ordinal and polytomous items with four response categories, which are treated as observable and dependent variables (Table 1) and are based on theoretical models defined in other studies   . The order of the items and the possible termination of the exercise after the first question assumes that if the individual had difficulty or could not feed him or herself, bathe or go to the bathroom, he or she could not perform the other tasks  . On the other hand, the measure “walk about 100 meters” would only be answered if the individual presented some difficulty in the item “walk more than one kilometre”, representing a major activity limitation.
Despite the fact that the theoretical variation of the latent variable was −∞ to +∞, considering practical limits θ was considered to vary from −5 to +5, where negative vales near −5 indicated low trace levels, while positive values closer to +5 represented the best functional capacity among the elderly.
2.3. Analytical Approach
Calibration IRT: The generalized partial credit model (GPCM) IRT   was used for the calibration of items (Annex). This estimates a discrimination parameter
and a three step difficulty parameter (
Table 1. Items related functional capacity index, Brazil, in 2008.
Note: The response categories are 0―cannot; 1―great difficulty; 2―little difficulty; and 3―no difficulty. Source: National Household Sample Survey, 2008.
The sample was composed of 56% women and 44% men, with 72% reported having experienced some kind of difficulty in performing one or more of the tasks described in Table 1. The average age of the general sample was 69.9 (CI95: 69.8 - 70.0) years, and 69.5 (CI95: 69.4 - 69.7) for men and 70.3 (CI95: 70.1 - 70.4) for women. The relative frequencies of the item response categories comprising the FCI are described in Table 2.
The estimated parameters and the standard errors of GPCM calibration are listed in Table 3.
It was noted that all items showed good discrimination, considering a cut-off number of one  . For the model, the highest discrimination was 4.96 for the WALK100M item, while the lowest discrimination was 2.58 for the RUN
Table 2. Relative frequencies of response categories for the seven items of functional capacity index (n = 41,269).
Note: Projection of the elderly (60+ years old): N = 21,030,606. Source: National Household Sample Survey, 2008.
Table 3. Estimates of the parameters of the generalized partial credit model for the functional capacity index (n = 41,269).
αi discrimination parameter; βi “local” parameter; γi, threshold parameter; SE, standard error. Note: Projection of the elderly (60+ years old): N = 21,030,606. Source: National Household Sample Survey, 2008.
item, with items with higher discrimination rates providing more information to the preconized construct. There were also threshold parameters indicating “local”, in which each response category to the item falls into the latent scale with a general amplitude between −2.36 (BATH) and 0.22 (RUN). Therefore, the item response categories were only endorsed by the elderly people who had the lowest levels of functional capacity , implying that the number of items is more useful in discriminating between individuals at the starting end of the continuous latent. Thus, it is likely that the construct cannot measure very high levels of functional capacity among the elderly.
When observing the local values, which are the average of the threshold parameters for each item, the item with the least functional activity was BATH (easier to perform) and the indicative item with the greatest functional capacity was RUN (more difficult to perform). Thus, an elderly person who could not feed, bathe or go to the bathroom clearly presented a highly unfavourable FCI as it is a very basic activity of daily life of an individual.
The contents of Table 3 can easily be interpreted by observing the category characteristic curves (CCC) for each of the items shown in Figure 1. These CCCs vary according to the discrimination parameter between items and according to the difficulty of each of the step parameters between the categories of the same item. The CCCs model the relationship between the probability of an elderly person endorsing a response category and the level θ construct measured by the scale.
The shapes of the CCCs are not consistent across the response categories: Category 0 is monotonically decreasing, category 3 is monotonically increasing, and categories 1 and 2 are unimodal. In Figure 1, for the RUN item the discrimination is relatively low , while for the CLIMBSTAIRS item the discrimination is more pronounced . In addition, the CLIMBSTAIRS item demonstrates relatively lower levels of functional capacity than the RUN item.
The item information functions (IIF) of the construct are shown in Figure 2(a) and show how difficulty variations affect measurement accuracy by latent continuous traits. The most reliable items measure the latent trait around the most accurately estimated difficulty parameter.
The WALK100M item has the highest slope, and therefore provides the maximum information among the seven items of the construct. However, if the level of the variable is low , the WALK100M item provides little information and, in return, the CLIMBSTAIRS item and, at the higher end of θ, the RUN item, are more informative.
The test information function (TIF) can be obtained by summing all the IIF. Thus, the more items that are added to the test, the greater the amount of information. So it can be said that having more information means to estimate a parameter with more precision and know more about the value of this parameter, in comparison if it had been estimated less precisely. Figure 2(b) shows the fit and the standard errors. The TIF provides the maximum information for the
Figure 1. Category characteristic curves for polytomous items of functional capacity index.
elderly located near . But only the θ values between −1.86
and 0.44 are estimated with an acceptable standard error level, i.e. below 0.3. The IRT standard error is greater at the extremes because the scale is infinite. The standard deviation gives the accuracy of estimating θ. For Bayesian scores, reliability can be given by: 1 − (error variance)/(observed variance + error variance), with the mean square error used for the error variance  . In this study, the average reliability was 0.8.
The characteristic curves for each item can be added to the test characteristic
Figure 2. Graphical representation of functional capacity index according to generalized partial credit model: item information functions; test information function; and test characteristic curve.
curve (TCC) (Figure 2(c)), showing the correspondence between the FCI and the total scores. The results estimate that an elderly person with θ = +1.96 features a summed IRT score of 21, or in other words where there is complete functional capacity. The discriminatory power of the scale is concentrated where the curves of the summed scores rise suddenly . To transform θ scores for the summed score scale, θ scores can be multiplied by the standard deviation of the summed scores and then added to its mean.
The main aim of this study was to disclose the use of IRT as a tool for creating shorter and more precise scales for measuring the limitations related to activities of daily living among the elderly, considering that some items behave differentially. Furthermore, the need to measure “unobservable” characteristics makes IRT a promising concept in terms of the clinical and epidemiological aspects of health. Increased experience among health service researchers should lead to better implementation and deployment of this method in the health field, with improved collective application of the methodology  . But it is necessary to reconsider and strengthen educational investment in the methodology and statistics of scale development, with an active role for both methodological researchers and experts  , considering the increasing availability of statistical packages with specific IRT analysis features.
The results of the present study showed that it is possible to use a robust latent trait-based scale and not only a systematically combined score approach to assess functional capacity     . In this respect, the primary advantage of IRT compared to other techniques is the ability to assign different weights to groups of related items and classify elderly persons within the latent scale of functional capacity, considering the difficulties experienced with each activity related to the item and the inherent ability of individuals to perform such tasks. Thus, the estimated levels of FCI consider the different contributions of each item to the latent trait scale.
Regarded as a measure of the positive aspects of the functionality component  , the FCI is a more understandable interface of the complexity of reality and can be used for health situation analysis for the purposes of developing public policy relating to the elderly. However, it is important to consider that the FCI is composed of indicators that add together various activities with different difficulties (feeding and bathing, for example) and, therefore, compromises not only the theoretical characterization of dimensionality but also the parameters related to IRT  . In this respect, it is essential that surveys allow items to be organized with descriptions of the separate tasks, allowing their respective strands to be assessed at a later stage. These possible areas could be applied to multidimensional models for the formation of a more robust index that reflects various characteristics of functionality among the elderly.
In addition, when the underlying assumptions were tested, the IRT model fit the observed data, and the evaluation of the performance level of each item, with respect to the response options and the item trunk, helped in the selection and writing of items to optimize future measures of physical functioning   . The unidimensional model of functional capacity can be extended by inserting new items to measure a unique feature of functionality, or new dimensions can be added. Support for the unidimensionality of the FCI items is consistent with previous studies  .
The IRT approach has a significant advantage over conventional approaches as the results obtained using different instruments are in the same scale  . Thus, the IRT framework allows different scales using various items related to activities of daily living to be compared without the need to change the scale. On the other hand, transformation of scales is frequently necessary to solve invariance problems when comparing different instruments using classical test theory, but with major bias, as often the instruments are not comparable and doubt may be cast on any comparison trial. It should be borne in mind that the origin of the scale is arbitrary for each data group, and that it is necessary to conduct match score tests to demonstrate comparability between groups or measures over time.
Another important point with IRT is that there is an assumption that scores are normally distributed. Thus, θ scores do not necessarily translate to percentile scores  in normal distribution and, therefore, caution is required when setting breakpoints in measurements of functional capacity based on this process. Moreover, it is possible that IRT scales do not meet the assumptions of normality for linear multilevel regression models, and therefore should be transformed into scales with ordered categories and be analysed by specific and sufficiently robust models  .
In the present study, greater reliability of the FCI could be achieved with the inclusion of more difficult items for individuals with greater functional capacity. Previous studies using IRT to generate physical functionality scales achieved satisfactory coverage of higher levels of physical functioning  .
In closing, the present study argues that the FCI can be used as a tool in the analysis and prioritization of health situations. While this is a brief approach to IRT and the model is simple enough, we believe in its potential within the context of health measurement processes and encourage readers to explore this (relatively) new methodology.
Table A1. Explanation of steps to estimate the generalized partial credit model of item response theory to the functional capacity index in elderly over 60 years of age, considering the weights and strata of the sampling plan of the National Household Sample Survey (PNAD, 2008), in Brazil, through specific commands of the statistical package STATA 14.1.
*Note: All parameters and related graphics in this study, as well as the operating differential analysis of the items are available on the IRT module of STATA 14.1.