PSYCH  Vol.9 No.11 , October 2018
Applied Psychometrics: The Steps of Scale Development and Standardization Process
This work focuses on presenting the development process of a self-reporting measurement instrument. Numerous scale development procedures are reviewed. They are all summarized into an overall framework of consecutive steps. A concise description is contained in each step. Issues covered comprise the following. First, the theoretical underpinning of the scale construct is described, along with the response specifications and response formats available (most popular like Likert and some more elaborated). Then the item writing guidelines follow together with strategies for discarding poor items when finalizing the item pool. The item selection criteria described comprise an expert panel review, pretesting and item analysis. Finally, the dimensionality evaluation is summarized along with test scoring and standardizing (norming). Scale construction has implications on research conclusions, affecting reliability and the statistical significance of the effects obtained or stated differently the accuracy and sensitivity of the instruments.

1. Introduction and Basic Concepts

Questionnaire (also called a test or a scale) is defined as a set of items designed to measure one or more underlying constructs, also called latent variables ( Fabrigar & Ebel-Lam, 2007). In other words, it is a set of objective and standardized self-report questions whose responses are then summed up to yield a score. Item score is defined as the number assigned to performance on the item, task, or stimulus ( Dorans, 2018: p. 578). The definition of a questionnaire or test is rather broad and encompasses everything from a scale, to measure life satisfaction (e.g. the SWLS Diener et al., 1985), to complete test batteries such as the Woodcock-Johnson IV battery by Schrank , Mather, and McGrew (2014) comprising cognitive tests, ( Irwing & Hughes, 2018). The scale items are indicators of the measured construct and hence the score is also an indicator of the construct ( Zumbo et al., 2002 ; Singh et al., 2016). Generally, there are attitude, trait, and ability scales ( Irwing & Hughes, 2018). Attitude, ability and intellectual reasoning measures or personality measures are considered as technical tools, equivalent e.g. to a pressure gauge or a voltmeter ( Coolican, 2014). Over the past decades, such instruments became popular in psychology mainly because they provide multiple related pieces of information on the latent construct been assessed ( Raykov, 2012). Scale Development or construction, is the act of assembling or/and writing the most appropriate items that constitute test questions ( Chadha, 2009) for a target population. The target population is as the group for whom the test is developed ( Dorans, 2018). Test development and standardization (or norming) are two related processes where test development comes first and standardization follows. During test development, after item assembly and analysis, the items which are strongest indicators of the latent construct measured are selected and the final pool emerges, whereas in standardization, standard norms are specified ( Chadha, 2009). Effective scale construction has important implications on research inferences, affecting first the quality and the size of the effects obtained and second the statistical significance of those effects ( Furr, 2011), or in other words the accuracy and sensitivity of the instruments ( Price, 2017). A set of standards for assessing standardized tests for psychology and education has been published jointly by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education ( AERA/APA/NCME, 1999, 2014 ; Streiner, Norman, & Cairney, 2015). Generally, successful tests are developed due to some combination of the three following conditions ( Irwing & Hughes, 2018): 1) Theoretical advances (e.g. NEO PI-R by Costa & McCrae, 1995); 2) Empirical advances (e.g. MMPI by Butcher, Dahlstrom, Graham, Tellegen, & Kaemmer, 1989); 3) A practical or market need (e.g. SAT by Coyle & Pillow, 2008).

The purpose of this work is to provide a review of the scale development and standardization process.

2. The Scale Development Process Overview

The scale development process as described by Trochim (2006) is completed in five steps (as quoted by Dimitrov, 2012): 1) Define the measured trait, assuming it is unidimensional. 2) Generate a pool of potential Likert items, (preferably 80-100) rated on a 5 or 7 disagree-agree response scale. 3) Have the items rated by a panel of experts on a 1 - 5 scale on how favorable the items measure the construct (from 1 = strongly unfavorable, to 5 = strongly favorable). 4) Select the items to retain for the final scale. 5) Administer the scale and to some of the responses of all items (raw score of the scale), reversing items that measure something in the opposite direction of the rest of the scale. Because the overall assessment with an instrument is based on the respondent’s scores on all items, the measurement quality of the total score is of particular interest ( Dimitrov, 2012). In a similar vein, Furr (2011) also described it as a process completed in five steps: (a) Define the Construct measured and the Context, (b) Choose Response Format, (c) Assemble the initial item pool, (d) Select and revise items and (e) Evaluate the psychometric properties (see relevant section). Steps (d) and (e) are an iterative process of refinement of the initial pool until the properties of the scale are adequate. Test score then can be standardized (see relevant section).

There are several models of test development. In practice, steps within the different stages may actually be grouped and undertaken in different combinations and sequences, and crucially, many steps of the process are iterative ( Irwing & Hughes, 2018). In Table 1the scale development process described by multiple different sources is presented as the steps suggested by different sources differ. Note that in Table 1an integrative approach to the scale development process combining steps by all sources is contained at the bottom of Table 1. The phases of the scale development process are presented in the sections below.

3. Phase A: Instrument Purpose and Construct Measured

When instruments are developed effectively, they show adequate reliability and validity supporting the use of resulting scores. To reach this goal, a systematic development approach is required ( Price, 2017). However, the development of scales to assess subjective attributes is considered rather difficult and requires both mental and financial resources ( Streiner et al., 2015). The prerequisite is to be aware of all existing scales that could suit the purpose of the measurement instrument you wish to develop, judging their use without any tendency to maximizing deficiencies before embark on any test construction adventure. Then, there is one more consideration: feasibility. Some feasibility dimensions need to be considered are time, cost, scoring, the method of administration, intrusiveness, the consequences of false-positive and false-negative decisions, and so forth ( Streiner et al., 2015). After that, the scale development process can start with the definition of the purpose of the instrument within a specific domain, the instrument score and the constraints inherent in the development ( Dimitrov, 2012 ; Price, 2017). As a rule, in the research field of psychology, the general purpose of a scale is to discriminate between individuals with high levels of the construct being measured from those with lower levels ( Furr, 2011).

However, the test developed should first determine clearly the intended construct been measured. Defining the construct to be measured is a crucial step requiring clarity and specify ( DeVellis,2017;Price,2017). Outlining a construct is that Item analysis can be carried-out within the SEM context, however this approach is beyond the scopes of this work. Refer to Raykov (2012) for details.

Response Bias

An additional consideration when selecting items is whether items cause response sets which either bias responses or generate response artifacts. Generally, this is mainly attributed to the sequence of items. The most common response sets are: yeah-saying (acquiescence bias―respondents agree with the statements), nay-saying (respondents reject the statements), consistency and availability artifacts, halo ( Thorndike, 1920 ; Campbell & Fiske, 1959: p. 84), and

Content is based on Streiner et al. (2015: p. 94) .

Figure 4. Overview of the pilot testing procedure and item analysis procedure.

social desirability artifacts, i.e. respondents try to present themselves in a favorable light Likert scales may also present a central tendency bias―respondents avoid selection of extreme scale categories ( Irwing & Hughes, 2018 ; Dimitrov, 2012).

7. Phase E: Testing the Psychometric Properties of the Scale

In the final phase of the test development process, a validation study is always carried out in a large and representative development sample ( DeVellis, 2017) to estimate further the psychometric properties of the scale ( Dimitrov, 2012). That is, after an initial pool of items has been developed and pilot tested (pre-tested) in a representative sample, the performance of the individual items to select the most appropriate to include in the final scale and to examine scale dimensionality ( DeVellis, 2017). The statistical techniques used for these purposes is item analysis (like during pretesting) and factor analysis ( Price, 2017). Criteria for item selection regarding item analysis in this phase are the same as in pretesting ( Singh et al., 2016). Dimensionality of a scale is examined with Exploratory Factor Analysis and Confirmatory Factor Analysis ( Furr, 2011 ; Singh et al., 2016). Usually, scales are administered, analyzed, revised, and readministered a number of times before their psychometric properties are acceptable ( Irwing & Hughes, 2018 ; Furr, 2011).

7.1. Dimensionality

A scale’s dimensionality, or factor structure, refers to the number and nature of the variables reflected in its items ( Furr, 2011). A scale measuring a single construct (e.g. property or ability) is called unidimensional. This means there is a single latent variable (factor) underlies the scale items. In contrast, a scale measuring two or more constructs (latent variables) is multidimensional ( Dimitrov, 2012).

Developers examine several issues regarding a scale’s dimensionality in this phase of the scale development process. First, they seek to define the number of dimensions underneath the construct. These are called latent variables (factors) and are measured by scale items. A scale is unidimensional when all items tap a single construct (e.g. self-esteem). On the other hand, a scale is multidimensional when scale items tap two or more latent variables, e.g. personality tests ( Dimitrov, 2012). If a scale is multidimensional, the developer also examines whether the dimensions are correlated with each other. Finally, in a multidimensional scale, the latent variables must be interpreted according to the theoretical background to see what dimensions they tap, identifying the nature of the construct the dimensions reflect ( Furr, 2011) demonstrating construct validity ( Streiner et al. (2015) and calculate the reliability of each one. Factor analysis has the answers to dimensionality questions (see Figure 5).

7.2. Factor Analysis

“Factor analysis is a statistical technique that provides a rigorous approach for

Source: Adapted by Furr, 2011: p. 26 .

Figure 5. The process of dimensionality evaluation of the scale under development and issues related with it.

confirming whether the set of test items comprises a test function in a way that is congruent with the underlying G theory of the test” ( Price, 2017: p. 180), based on the classical measurement theory, also termed Classical Test Theory ( DeVellis, 2017). Factor analysis is an integral part of scale development. It permits data to be analyzed to determine the number of underlying factors bet heath a group of items called factor so that analytic procedures of the psychometric properties like Cronbach’s alpha ( Cronbach, 1951) correlations with other constructs can be performed properly. Eventually, through factor identification insights into the latent variable nature underlying the scale items is gained ( DeVellis, 2017). A factor is defined as an unobserved or latent variable representative of a construct ( Price, 2017: p. 236).

The detailed description of these techniques is beyond the scope of this work but you can refer to Kyriazos (2018a, 2018b) for a complete description of the construct validation process. For scale validation studies refer to Howard et al. (2016) , El Akremi, Gond, Swaen, De Roeck, and Igalens (2015) , Konrath, Meier, Bushman (2017) . Pavot (2018) also suggest reviewing Lyubomirsky and Lepper (1999) , Seligson, Huebner, and Valois (2003) and Diener et al. (2010) .

7.3. Item Response Theory (IRT)

There is also an alternative to the classical test theory model called Item response theory (IRT). IRT is often presented as a superior alternative to CTT (see De Boeck & Wilson, 2004 ; Embretson & Reise, 2010 ; Nering & Ostini, 2010 ; Reise & Revicki, 2015 quoted by DeVellis, 2017). IRT is a model-based measurement approach using item response patterns and a person’s abilities. In IRT, personal responses to each scale item are explainable based on his or her ability level. The respondent’s ability is represented by a monotonically increasing function, based on response patterns ( Price, 2017).

According to IRT, several factors affect a person’s responses. Along with the person’s perceived level of the construct being measured by each scale item, other item properties potentially affecting responses are: (a) item difficulty, (b) item discrimination, and (c) guessing. In most IRT applications in the context of psychology, researchers estimate both psychometric properties at the item level and at the scale level. IRT includes many specific measurement models as a function of different factors potentially affecting individual responses. However, all IRT models are framed according to the probability of a respondent to respond in a specific manner to an item, as a result of a specific level of the underlying behavior. The simplest IRT measurement models comprise only item difficulty while more complex models also comprise two or more item parameters, such as item discrimination and guessing. There are different models for dichotomous items and different for polytomous items ( Furr, 2011). IRT models also vary according to the number of item response options.

The effectiveness of a technique is a function of the theoretical framework of the target construct. IRT scoring is used in tests of cognitive ability, however, in other situations, this type of scoring may not be desirable ( Irwing & Hughes, 2018). A combination of CTT and TRT was suggested as an alternative option ( Embretson & Hershberger, 1999 ; DeVellis, 2017 ; Irwing & Hughes, 2018). In most cases a common practice in test development involves a combination either of confirmatory factor analysis (CFA) and IRT ( Irwing & Hughes, 2018) or more commonly a combination of EFA and CFA ( Steger et al., 2006 ; Fabrigar & Wegener, 2012 ; Kyriazos, 2018a).

7.4. Test Scoring and Standardization (Norming)

Raw scale scores can either be based on a unit-weighted sum of item scores or on factor scores. Unit weighted scoring schemas, generate standardized scores using an appropriate standardization sample, or normative sample ( Dimitrov, 2012), for example, stanine, sten, and t scores ( Smith & Smith, 2005). Unit weighted sums of item scores without standardization may be considered at some research frameworks. Box-Cox procedures ( Box & Cox, 1964) to estimate the power to which the scale score should be raised to follow normality. Subsequently, the scale score is also raised to the previously estimated power and standardized. Standardization (or norming) is carried out by subtracting the mean transformed score from the transformed scale scores and dividing by the standard deviation of the transformed scores ( Irwing & Hughes, 2018). A standardized score denotes the relative position of each respondent in the target population ( Dimitrov, 2012).

Streiner et al. (2015) note the following: (A) Variable weighting on scale items is effective only under certain conditions. (B) if a test is constructed for local/limited use only the sum of the items is probably sufficient. To enable comparison of the results with other instruments, scores is suggested to transformed into percentiles, into z-scores or T-scores. (C) For measurement of attributes that are not the same in males and females, or for attributes that show development changes then separate age and/or age-sex norms can be considered ( Streiner et al., 2015).

8. Summary & Conclusions

Experts suggest that effective measurement is the cornerstone of scientific research ( DeVellis, 2017 ; Netemeyer, Bearden, & Sharma, 2003) and it is an integral part of the latent variable model ( Slavec & Novsek, 2012). Generally, there are attitude, trait, and ability measures. The purpose of scaling is to construct a scale with specific measurement characteristics for the construct measured. The most commonly employed response formats in all psychology are the Likert type, multiple choice, or forced-choice items. Scaling generally is divided into the types established by Thurstone (1927, 1928) , Likert (1932, 1952) , or Guttman (1941, 1944, 1946) . In Likert scaling the response levels are anchored with consecutive integer values, each corresponding to verbal labels indicating approximately evenly spaced intervals and it is the most popular scale in measures of psychology ( Dimitrov, 2012 ; Furr, 2011 , Barker et al., 2016). To a degree, the scaling type and the response format, have an impact on item writing and on the scale development as a whole ( Irwing & Hughes, 2018). An item pool should be as rich as possible for the developing scale. It should contain numerous items pertinent to the target construct ( DeVellis, 2017). Steps of an instrument development process involves the following: 1) the definition of instrument purpose, domain and construct; 2) defining the response scale format; 3) item generation to construct an item pool 2 - 4 times larger than the desired length of the final scale version; 4) item selection based on expert panel reviews and/or pretesting to maximize instrument reliability with item analysis; 5) large-scale validation study(s) to establish construct validity with supplementary item analysis, factor analysis and to standardize the scale scores.

Construct validation studies to evaluate scale dimensionality and norming is a necessary step in scale development after the pool is examined by experts and/or pretesting. The reliability of measurements signifies the degree to which a score shows accuracy, consistency, and replicability. Construct validity is mainly evidenced by the correlational and measurement consistency of the target construct and its items (indicators) mainly by carving out a factor analysis ( Dimitrov, 2012). Scales which are developed thoughtfully and precisely have a greater potential of growing into questionnaires that measure real-world criteria more accurately ( Saville & MacIver, 2017).

Cite this paper
Kyriazos, T. and Stalikas, A. (2018) Applied Psychometrics: The Steps of Scale Development and Standardization Process. Psychology, 9, 2531-2560. doi: 10.4236/psych.2018.911145.

[1]   Ackerman, T. A. (1992). A Didactic Explanation of Item Bias, Item Impact, and Item Validity from a Multidimensional Perspective. Journal of Educational Measurement, 29, 67-91.

[2]   Aiken, L. R. (2002). Attitudes and Related Psychosocial Constructs: Theories, Assessment and Research. Thousand Oaks, CA: Sage.

[3]   Aitken, R. C. B. (1969). A Growing Edge of Measurement of Feelings. Proceedings of the Royal Society of Medicine, 62, 989-92.

[4]   Ajzen, I. (1991). The Theory of Planned Behavior. Organizational Behavior and Human Decision Processes, 50, 179-211.

[5]   Allen, M. J., & Yen, W. M. (1979). Introduction to Measurement Theory. Monterey, CA: Brooks/Cole.

[6]   American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (AERA/APA/NCME) (1999). Standards for Educational and Psychological Testing (2nd ed.). Washington DC: Authors.

[7]   American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (AERA/APA/NCME) (2014). Standards for Educational and Psychological Testing (3rd ed.). Washington DC: Authors.

[8]   Anastasi, A. (1982). Psychological Testing (5th ed.). Macmillan, New York.

[9]   Anastasi, A., & Urbina, S. (1996). Psychological Testing (7th ed.). New York, ΝY: Pearson.

[10]   Bandura, A. (1997). Self-Efficacy: The Exercise of Control. New York, NY: Freeman.

[11]   Barker, C., Pistrang, N., & Elliott, R. (2016). Research Methods in Clinical Psychology: An Introduction for Students and Practitioners (3rd ed.). Oxford, UK: John Wiley & Sons, Ltd.

[12]   Bishop, G. F. (1990). Issue Involvement and Response Effects in Public Opinion Surveys. Public Opinion Quarterly, 54, 209-218.

[13]   Box, G. E. P., & Cox, D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26, 211-254.

[14]   Bradburn, N. M., Sudman, S., & Wansink, B. (2004). Asking Questions: The Definitive Guide to Questionnaire Design—For Market Research, Political Polls, and Social and Health Questionnaires. San Francisco, CA: Jossey-Bass.

[15]   Butcher, J. N., Dahlstrom, W. G., Graham, J. R., Tellegen, A., & Kaemmer, B. (1989). Minnesota Multiphasic Personality Inventory-2 (MMPI-2). Manual for Administration and Scoring. Minneapolis: University of Minnesota Press.

[16]   Campbell, D. T., & Fiske, D. W. (1959). Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix. Psychological Bulletin, 56, 81-105.

[17]   Carp, F. M. (1989). Maximizing Data Quality in Community Studies of Older People. In M. P. Lawton, & A. R. Herzog (Eds.), Special Research Methods for Gerontology (pp. 93-122). Amityville, NY: Baywood Publishing.

[18]   Chadha, N. K. (2009). Applied Psychometry. New Delhi, IN: Sage Publications.

[19]   Clauser, B. E. (2000). Recurrent Issues and Recent Advances in Scoring Performance Assessments. Applied Psychological Measurement, 24, 310-324.

[20]   Coolican, H. (2014). Research Methods and Statistics in Psychology (6th ed.). New York: Psychology Press.

[21]   Costa, P. T., & McCrae, R. R. (1995). Domains and Facets: Hierarchical Personality Assessment Using the Revised NEO Personality Inventory. Journal of Personality Assessment, 64, 21-50.

[22]   Coyle, T. R., & Pillow, D. R. (2008). SAT and ACT Predict College GPA after Removing g. Intelligence, 36, 719-729.

[23]   Crocker, L., & Algina, J. (1986). Introduction to Classical and Modern Test Theory. New York: Holt, Rinehart & Winston.

[24]   Cronbach, L. J. (1950). Further Evidence on Response Sets and Test Design. Educational and Psychological Measurement, 10, 3-31.

[25]   Cronbach, L. J. (1951). Coefficient Alpha and the Internal Structure of Tests. Psychometrika, 16, 297-334.

[26]   Dale, F., & Chall, J. E. (1948). A Formula for Predicting Readability: Instructions. Education Research Bulletin, 27, 37-54.

[27]   De Boeck, P., & Wilson, M. (2004). Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach. New York: Springer.

[28]   Demaio, T., & Landreth, A. (2004). Do Different Cognitive Interview Methods Produce Different Results. In S. Presser, J. Rothgeb, M. Couper, J. Lessler, E. Martin, J. Martin, & E. Singer (Eds.), Questionnaire Development and Testing Methods. Hoboken, NJ: Wiley.

[29]   DeVellis, R. F. (2017). Scale Development: Theory and Applications (4th ed.). Thousand Oaks, CA: Sage.

[30]   Dickinson, T. L., & Zellinger, P. M. (1980). A Comparison of the Behaviorally Anchored Rating Mixed Standard Scale Formats. Journal of Applied Psychology, 65, 147-154.

[31]   Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction with Life Scale. Journal of Personality Assessment, 49, 71-75.

[32]   Diener, E., Wirtz, D., Tov, W., Kim-Prieto, C., Choi, D. W., Oishi, S. et al. (2009). New Well-Being Measures: Short Scales to Assess Flourishing and Positive and Negative Feelings. Social Indicators Research, 97, 143-156.

[33]   Diener, E., Wirtz, D., Tov, W., Kim-Prieto, C., Choi, D., Oishi, S., & Biswas-Diener, R. (2010). New Wellbeing Measures: Short Scales to Assess Flourishing and Positive and Negative Feelings. Social Indicators Research, 97, 143-156.

[34]   Dimitrov, D. M. (2012). Statistical Methods for Validation of Assessment Scale Data in Counseling and Related Fields. Alexandria, VA: American Counseling Association.

[35]   Dorans, N. J. (2018). Scores, Scales, and Score Linking. In P. Irwing, T. Booth, & D. J. Hughes (Eds.), The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test Development, V.II (pp. 573-606). Hoboken, NJ: Wiley.

[36]   El Akremi, A., Gond, J.-P., Swaen, V., De Roeck, K., & Igalens, J. (2015). How Do Employees Perceive Corporate Responsibility? Development and Validation of a Multidimensional Corporate Stakeholder Responsibility Scale. Journal of Management, 44, 619-657.

[37]   Embretson, S. E., & Hershberger, S. L. (1999). Summary and Future of Psychometric Models in Testing. In S. E. Embretson, & S. L. Hershberger (Eds.), The New Rules of Measurement (pp. 243-254). Mahwah, NJ: Lawrence Erlbaum.

[38]   Embretson, S. E., & Reise, S. P. (2010). Item Response Theory (2nd ed.). New York, NY: Routledge Academic.

[39]   Fabrigar, L. R., & Ebel-Lam, A. (2007). Questionnaires. In N. J. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 808-812). Thousand Oaks, CA: Sage.

[40]   Fabrigar, L. R., & Wegener, D. T. (2012). Exploratory Factor Analysis. New York, NY: Oxford University Press, Inc.

[41]   Fredrickson, B. L. (1998). Cultivated Emotions: Parental Socialization of Positive Emotions and Self-Conscious Emotions. Psychological Inquiry, 9, 279-281.

[42]   Fredrickson, B. L. (2001). The Role of Positive Emotions in Positive Psychology: The Broaden-and-Build Theory of Positive Emotions. American Psychologist, 56, 218-226.

[43]   Fredrickson, B. L. (2003). The Value of Positive Emotions: The Emerging Science of Positive Psychology Is Coming to Understand Why It’s Good to Feel Good. American Scientist, 91, 330-335.

[44]   Fredrickson, B. L. (2013). Positive Emotions Broaden and Build. In Advances in Experimental Social Psychology (Vol. 47, pp. 1-53). Cambridge, MA: Academic Press.

[45]   Fry, E. (1977). Fry’s Readability Graph: Clarifications, Validity, and Extension to Level 17. Journal of Reading, 21, 249.

[46]   Furr, R. M. (2011). Scale Construction and Psychometrics for Social and Personality Psychology. New Delhi, IN: Sage Publications.

[47]   Gable, R. K., & Wolfe, M. B. (1993). Instrument Development in the Affective Domain: Measuring Attitudes and Values in Corporate and School Settings (2nd ed.). Boston, MA: Kluwer.

[48]   Green, P. E., & Rao, V. R. (1970). Rating Scales and Information Recovery—How Many Scales and Response Categories to Use? Journal of Marketing, 34, 33-39.

[49]   Guttman, L. (1941). The Quantification of a Class of Attributes: A Theory and Method for Scale Construction. In P. Horst (Ed.), The Prediction of Personal Adjustment (pp. 321-348). New York: Social Science Research Council.

[50]   Guttman, L. (1946). An Approach for Quantifying Paired Comparisons and Rank Order. Annals of Mathematical Statistics, 17, 144-163.

[51]   Guttman, L. A. (1944). A Basis for Scaling Qualitative Data. American Sociological Review, 9, 139-150.

[52]   Haladyna, T. M. (1999). Developing and Validating Multiple-choice Items (2nd ed.). Mahwah, NJ: Erlbaum.

[53]   Haladyna, T. M. (2004). Developing and Validating Multiple-Choice Test Items. Mahwah, NJ: Erlbaum.

[54]   Harter, S. (1982). The Perceived Competence Scale for Children. Child Development, 53, 87-97.

[55]   Hathaway, S. R., & McKinley, J. C. (1951). Manual for the Minnesota Multiphasic Personality Inventory (Rev. ed.). New York: Psychological Corporation.

[56]   Hawthorne, G., Mouthaan, J., Forbes, D., & Novaco, R. W. (2006). Response Categories and Anger Measurement: Do Fewer Categories Result in Poorer Measurement? Development of the DAR5. Social Psychiatry and Psychiatric Epidemiology, 41, 164-172.

[57]   Hayes, M. H. S., & Patterson, D. G. (1921). Experimental Development of the Graphic Rating Method. Psychological Bulletin, 18, 98-99.

[58]   Heise, D. R. (1970). Chapter 14. The Semantic Differential and Attitude Research. In G. F. Summers (Ed.), Attitude Measurement (pp. 235-253). Chicago, IL: Rand McNally.

[59]   Hoek, J. A., & Gendall, P. J. (1993). A New Method of Predicting Voting Behavior. International Journal of Market Research, 35, 1-14.

[60]   Howard, J., Gagné, M., Morin, A. J., & Van den Broeck, A. (2016). Motivation Profiles at Work: A Self-Determination Theory Approach. Journal of Vocational Behavior, 95-96, 74-89.

[61]   Huskisson, E. C. (1974). Measurement of Pain. The Lancet, 304, 1127-1131.

[62]   Irwing, P., & Hughes, D. J. (2018). Test Development. In P. Irwing, T. Booth, & D. J. Hughes (Eds.), The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test Development, V.I (pp. 4-47). Hoboken, NJ: Wiley.

[63]   Janda, L. H. (1998). Psychological Testing: Theory and Applications. Boston, MA: Allyn & Bacon.

[64]   Jenkins, G. D., & Taber, T. D. (1977). A Monte Carlo Study of Factors Affecting Three Indices of Composite Scale Reliability. Journal of Applied Psychology, 62, 392-398.

[65]   Jones, R. R. (1968). Differences in Response Consistency and Subjects’ Preferences for Three Personality Response Formats. In Proceedings of the 76th Annual Convention of the American Psychological Association (pp. 247-248) Washington DC.

[66]   Kline, R. B. (2009). Becoming a Behavioral Science Researcher: A Guide to Producing Research That Matters. New York: Guilford Publications.

[67]   Konrath, S., Meier, B. P., & Bushman, B. J. (2017). Development and Validation of the Single Item Trait Empathy Scale (SITES). Journal of Research in Personality, 73, 111-122.

[68]   Krosnick, J. A., & Presser, A. (2010). Question and Questionnaire Design. In P. V. Marsden, & J. D. Wright (Eds.), Handbook of Survey Research (2nd ed., pp. 264-313). Bingley, UK: Emerald.

[69]   Krosnick, J. A., & Schuman, H. (1988). Attitude Intensity, Importance, and Certainty and Susceptibility to Response Effects. Journal of Personality and Social Psychology, 54, 940-952.

[70]   Kyriazos, T. A. (2018a). Applied Psychometrics: The 3-Faced Construct Validation Method, a Routine for Evaluating a Factor Structure. Psychology, 9, 2044-2072.

[71]   Kyriazos, T. A. (2018b). Applied Psychometrics: Sample Size and Sample Power Considerations in Factor Analysis (EFA, CFA) and SEM in General. Psychology, 9, 2207-2230.

[72]   Lawshe, C. H. (1975). A Quantitative Approach to Content Validity. Personnel Psychology, 28, 563-575.

[73]   Lehmann, D. R., & Hulbert, J. (1972). Are Three-Point Scales Always Good Enough? Journal of Marketing Research, 9, 444-446.

[74]   Likert, R. (1932). A Technique for the Measurement of Attitudes. Archives of Psychology, 140, 1-55.

[75]   Likert, R. A. (1952). A Technique for the Development of Attitude Scales. Educational and Psychological Measurement, 12, 313-315.

[76]   Lindzey, G. G., & Guest, L. (1951). To Repeat—Check Lists Can Be Dangerous. Public Opinion Quarterly, 15, 355-358.

[77]   Lissitz, R. W., & Green, S. B. (1975). Effect of the Number of Scale Points on Reliability: A Monte Carlo Approach. Journal of Applied Psychology, 60, 10-13.

[78]   Lynn, M. R. (1986). Determination and Quantification of Content Validity. Nursing Research, 35, 382-386.

[79]   Lyubomirsky, S., & Lepper, H. S. (1999). A Measure of Subjective Happiness: Preliminary Reliability and Construct Validation. Social Indicators Research, 46, 137-155.

[80]   Martin, W. S. (1973). The Effects of Scaling on the Correlation Coefficient: A Test of Validity. Journal of Marketing Research, 10, 316-318.

[81]   Martin, W. S. (1978). Effects of Scaling on the Correlation Coefficient: Additional Considerations. Journal of Marketing Research, 15, 304-308.

[82]   McCullough, M. E., Emmons, R. A., & Tsang, J. (2002). The Grateful Disposition: A Conceptual and Empirical Topography. Journal of Personality and Social Psychology, 82, 112-127.

[83]   Milfont, T. L., & Fischer, R. (2010). Testing Measurement Invariance across Groups: Applications in Cross-Cultural Research. International Journal of Psychological Research, 3, 111-121.

[84]   Miller, G. A. (1956). The Magic Number Seven plus or minus Two: Some Limits on Our Capacity for Processing Information. Psychological Bulletin, 63, 81-97.

[85]   Morrison, K. M., & Embretson, S. (2018). Item Generation. In P. Irwing, T. Booth, & D. J. Hughes (Eds.), The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test Development, V.I (pp. 46-96). Hoboken, NJ: Wiley.

[86]   Muthén, L. K., & Muthén, B. O. (2009). Mplus Short Courses, Topic 1: Exploratory Factor Analysis, Confirmatory Factor Analysis, and Structural Equation Modeling for Continuous Outcomes. Los Angeles, CA: Muthén & Muthén.

[87]   Nering, M. L., & Ostini, R. (2010). Handbook of Polytomous Item Response Theory Models. New York: Routledge.

[88]   Netemeyer, R. G., Bearden, W. O., & Sharma, S. (2003). Scaling Procedures: Issues and Applications. Thousand Oaks, CA: Sage Publications.

[89]   Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric Theory (3rd ed.). New York: McGraw-Hill.

[90]   O’Muircheartaigh, C., Krosnick, J. A., & Helic, A. (1999). Middle Alternatives, Acquiescence, and the Quality of Questionnaire Data. Paper presented at the American Association for Public Opinion Research Annual Meeting, St. Petersburg, FL.

[91]   O’Muircheartaigh, C., Krosnick, J. A., & Helic, A. (2000). Middle Alternatives, Acquiescence, and the Quality of Questionnaire Data.

[92]   Osgood, C. E., & Tannenbaum, P. H. (1955). The Principle of Congruence in the Prediction of Attitude Change. Psychological Bulletin, 62, 42-55.

[93]   Osgood, C. E., Tannenbaum, P. H., & Suci, G. J. (1957). The Measurement of Meaning. Urbana, IL: University of Illinois Press.

[94]   Pavot, W. (2018). The Cornerstone of Research on Subjective Well-Being: Valid Assessment Methodology. In E. Diener, S. Oishi, & L. Tay (Eds.), Handbook of Well-Being. Salt Lake City, UT: DEF Publishers.

[95]   Presser, S., & Blair, J. (1994). Survey Pretesting: Do Different Methods Produce Different Results? In P. Marsden (Ed.), Sociology Methodology (Vol. 24, pp. 73-104). Washington DC: American Sociological Association.

[96]   Price, L. R. (2017). Psychometric Methods: Theory into Practice. New York: The Guilford Press

[97]   Prochaska, J. O., Norcross, J. C., Fowler, J., Follick, M. J., & Abrams, D. B. (1992). Attendance and Outcome in a Worksite Weight Control Program: Processes and Stages of Change as Process and Predictor Variables. Addictive Behaviors, 17, 35-45.

[98]   Ramsay, J. O. (1973). The Effect of Number Categories in Rating Scales on Precision of Estimation of Scale Values. Psychometrika, 38, 513-532.

[99]   Raykov, T. (2012). Scale Construction and Development Using Structural Equation Modeling. In R. H. Hoyle (Ed.), Handbook of Structural Equation Modeling (pp. 472-492). New York: Guilford Press.

[100]   Reise, S. P., & Revicki, D. A. (2015). Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment. New York: Routledge.

[101]   Saris, W. E., & Gallhofer, I. N. (2007). Design, Evaluation, and Analysis of Questionnaires for Survey Research. Hoboken, NJ: Wiley.

[102]   Saville, P., & MacIver, R. (2017). A Very Good Question? In B. Cripps (Ed.), Psychometric Testing: Critical Perspectives (pp. 29-42). West Sussex, UK: John Wiley & Sons, Ltd.

[103]   Sawilowsky, S. S. (2007). Construct Validity. In N. J. Salkind (Ed.), Encyclopedia of Measurement and Statistics (pp. 178-180). Thousand Oaks, CA: Sage.

[104]   Schrank, F. A., McGrew, K. S., & Mather, N. (2014). Woodcock-Johnson IV Tests of Cognitive Abilities. Rolling Meadows, IL: Riverside.

[105]   Schuman, H., & Scott, J. (1987). Problems in the Use of Survey Questions to Measure Public Opinion. Science, 236, 957-959.

[106]   Schwartz, N. (1999). Self-Reports: How the Questions Shape the Answers. American Psychologist, 54, 93-105.

[107]   Schwarzer, R. (2001). Social-Cognitive Factors in Changing Health-Related Behavior. Current Directions in Psychological Science, 10, 47-51.

[108]   Scott, P. J., & Huskisson, E. C. (1978). Measurement of Functional Capacity with Visual Analog Scales. Rheumatology and Rehabilitation, 16, 257-259.

[109]   Seligman, M. E. (1998). What Is the Good Life? APA Monitor, 29, 2.

[110]   Seligman, M. E., & Csikszentmihalyi, M. (2000). Positive Psychology: An Introduction. American Psychologist, 55, 5-14.

[111]   Seligman, M. E., & Pawelski, J. O. (2003). Positive Psychology: FAQS. Psychological Inquiry, 14, 159-163.

[112]   Seligson, J. L., Huebner, E. S., & Valois, R. F. (2003). Preliminary Validation of the Brief Multidimensional Students’ Life Satisfaction Scale (BMSLSS). Social Indicators Research, 61, 121-145.

[113]   Singh, K., Junnarkar, M., & Kaur, J. (2016). Measures of Positive Psychology: Development and Validation. Berlin: Springer.

[114]   Slavec, A., & Drnovsek, M. (2012). A Perspective on Scale Development in Entrepreneurship. Economic and Business Review, 14, 39-62.

[115]   Smith, B. W., Dalen, J., Wiggins, K., Tooley, E., Christopher, P., & Bernard, J. (2008). The Brief Resilience Scale: Assessing the Ability to Bounce Back. International Journal of Behavioral Medicine, 15, 194-200.

[116]   Smith, M., & Smith, P. (2005). Testing People at Work: Competencies in Psychometric Testing. London: Blackwell.

[117]   Srinivasan, V., & Basu, A. K. (1989). The Metric Quality of Ordered Categorical Data. Marketing Science, 8, 205-230.

[118]   Steger, M. F., Frazier, P., Oishi, S., & Kaler, M. (2006). The Meaning in Life Questionnaire. Assessing the Presence of and Search for Meaning in Life. Journal of Counseling Psychology, 53, 80-93.

[119]   Streiner, D. L., Norman, G. R., & Cairney, J. (2015). Health Measurement Scales: A Practical Guide to Their Development and Use (5th ed.). Oxford, UK: Oxford University Press.

[120]   Sudman, S., & Bradburn, N. M. (1982). Asking Questions: A Practical Guide to Questionnaire Design. San Francisco, CA: Jossey-Bass.

[121]   Taylor, J. A. (1953). A Personality Scale of Manifest Anxiety. Journal of Abnormal and Social Psychology, 48, 285-290.

[122]   Thorndike, E. L. (1920). A Constant Error in Psychological Ratings. Journal of Applied Psychology, 4, 25-29.

[123]   Thurstone, L. L. (1927). Three Psychophysical Laws. Psychological Review, 34, 424-432.

[124]   Thurstone, L. L. (1928). Attitudes Can Be Measured. American Journal of Sociology, 33, 529-554.

[125]   Torgerson, W. (1958). Theory and Methods of Scaling. New York: Wiley.

[126]   Trochim, W. M. (2006). The Research Methods Knowledge Base (2nd ed.).

[127]   Waltz, C. W., & Bausell, R. B. (1981). Nursing Research: Design, Statistics and Computer Analysis. Philadelphia, PA: F.A. Davis.

[128]   Wechsler, D. (1958). The Measurement and Appraisal of Adult Intelligence (4th ed.). Baltimore, MD: Williams and Wilkins.

[129]   Willis, G., Schechter, S., & Whitaker, K. (2000). A Comparison of Cognitive Interviewing, Expert Review and Behavior Coding: What Do They Tell Us? In Proceedings of the Section on Survey Methods (pp. 28-37). Alexandria, VA: American Statistical Association.

[130]   Willms, D. G., & Johnson, N. A. (1993). Essentials in Qualitative Research: A Notebook for the Field. Unpublished Manuscript, Hamilton, ON: McMaster University.

[131]   Wilson, M. (2005). Constructing Measures: An Item Response Modeling Approach. Mahwah, NJ: Erlbaum.

[132]   Wolfe, E. W., & Smith Jr., E. V. (2007). Instrument Development Tools and Activities for Measure Validation Using Rasch Models: Part I Instrument Development Tools. Journal of Applied Measurement, 8, 97-123.

[133]   Wright, B. D., & Masters, G. N. (1982). Rating Scale Analysis. Chicago, IL: MESA Press.

[134]   Zumbo, B. D., Gelin, M. N., & Hubley, A. M. (2002). The Construction and Use of Psychological Tests and Measures. In Encyclopedia of Life Support Systems. France: United Nations Educational, Scientific, and Cultural Organization Publishing (UNESCO-EOLSS Publishing).