Practical Implications of Sum Scores Being Psychometrics’ Greatest Accomplishment

Daniel McNeish

doi:10.1007/s11336-024-09988-z

Practical Implications of Sum Scores Being Psychometrics’ Greatest Accomplishment

Published online by Cambridge University Press: 01 January 2025

Daniel McNeish

Show author details

Daniel McNeish*: Affiliation:
Arizona State University
*: Correspondence should be made to Daniel McNeish, Department of Psychology, Arizona State University, PO Box 871104, Tempe, AZ 85287, USA. Email: dmcneish@asu.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper reflects on some practical implications of the excellent treatment of sum scoring and classical test theory (CTT) by Sijtsma et al. (Psychometrika 89(1):84–117, 2024). I have no major disagreements about the content they present and found it to be an informative clarification of the properties and possible extensions of CTT. In this paper, I focus on whether sum scores—despite their mathematical justification—are positioned to improve psychometric practice in empirical studies in psychology, education, and adjacent areas. First, I summarize recent reviews of psychometric practice in empirical studies, subsequent calls for greater psychometric transparency and validity, and how sum scores may or may not be positioned to adhere to such calls. Second, I consider limitations of sum scores for prediction, especially in the presence of common features like ordinal or Likert response scales, multidimensional constructs, and moderated or heterogeneous associations. Third, I review previous research outlining potential limitations of using sum scores as outcomes in subsequent analyses where rank ordering is not always sufficient to successfully characterize group differences or change over time. Fourth, I cover potential challenges for providing validity evidence for whether sum scores represent a single construct, particularly if one wishes to maintain minimal CTT assumptions. I conclude with thoughts about whether sum scores—even if mathematically justified—are positioned to improve psychometric practice in empirical studies.

Keywords

sum scores total score composite score reporting practices CTT classical test theory

Type: Original Research
Information: Psychometrika , Volume 89 , Issue 4 , December 2024 , pp. 1148 - 1169

DOI: https://doi.org/10.1007/s11336-024-09988-z [Opens in a new window]
Copyright: © 2024 The Author(s), under exclusive licence to The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Adjerid, I., Kelley, K. (2018). Big data in psychology: A framework for research advancement. American Psychologist, 73(7), 899–917.CrossRef Google Scholar PubMed

Aiken, L. S., West, S. G., Millsap, R. E. (2008). Doctoral training in statistics, measurement, and methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Reno’s (1990) survey of PhD programs in North America. American Psychologist, 63(1), 32–50.CrossRef Google Scholar PubMed

Alexandrova, A., Haybron, D. M. (2016). Is construct validation valid?. Philosophy of Science, 83(5), 1098–1109.CrossRef Google Scholar

Altman, D. G., Bland, J. M. (1983). Measurement in medicine: The analysis of method comparison studies. Journal of the Royal Statistical Society, Series D: The Statistician, 32(3), 307–317.Google Scholar

Angrist, J. D. (2004). American education research changes tack. Oxford Review of Economic Policy, 20(2), 198–212.CrossRef Google Scholar

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.Google Scholar

Aiken, L. R. (1980). Content validity and reliability of single items or questionnaires. Educational and Psychological Measurement, 40(4), 955–959.CrossRef Google Scholar

Bauer, D. J. (2017). A more general model for testing measurement invariance and differential item functioning. Psychological Methods, 22(3), 507–526.CrossRef Google Scholar PubMed

Beauducel, A., Hilger, N. (2020). On the fit of models implied by unit-weighted scales. Communications in Statistics-Simulation and Computation, 49(11), 3054–3064.CrossRef Google Scholar

Beauducel, A., Leue, A. (2013). Unit-weighted scales imply models that should be tested!. Practical Assessment, Research & Evaluation, 18(1), 1–7.Google Scholar

Beauducel, A. (2007). In spite of indeterminacy many common factor score estimates yield an identical reproduced covariance matrix. Psychometrika, 72(3), 437–441.CrossRef Google Scholar

Bleidorn, W., Hopwood, C. J. (2019). Using machine learning to advance personality assessment and theory. Personality and Social Psychology Review, 23(2), 190–203.CrossRef Google Scholar PubMed

Borsboom, D., Mellenbergh, G. J. (2004). Why psychometrics is not pathological: A comment on Michell. Theory & Psychology, 14(1), 105–120.CrossRef Google Scholar

Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psychometrics, Cambridge University Press.CrossRef Google Scholar

Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71(3), 425–440.CrossRef Google Scholar PubMed

Blanchin, M., Hardouin, J. B., Neel, T. L., Kubis, G., Blanchard, C., Mirallié, E., Sébille, V. (2011). Comparison of CTT and Rasch-based approaches for the analysis of longitudinal patient reported outcomes. Statistics in Medicine, 30(8), 825–838.CrossRef Google Scholar PubMed

Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In Brennan, R. L. (Eds), Educational measurement, 4 Praeger 1–16.Google Scholar

Chinni, M. L., & Hubley, A. M. (2014). A research synthesis of validation practices used to evaluate the Satisfaction with Life Scale (SWLS). In B. D. Zumbo & E. K. Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 35–66). Springer.CrossRef Google Scholar

Christensen, A. P., Golino, H., Silvia, P. J. (2020). A psychometric network perspective on the validity and validation of personality trait questionnaires. European Journal of Personality, 34(6), 1095–1108.CrossRef Google Scholar

Cohen, J. (1990). Things i have learned (so far). American Psychologist, 45(12), 1304–1312.CrossRef Google Scholar

Collie, R. J., & Zumbo, B. D. (2014). Validity evidence in the journal of educational psychology: Documenting current practice and a comparison with earlier practice. In B. D. Zumbo & E. K. Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 113–135). Springer.CrossRef Google Scholar

Coxe, S., Sibley, M. H. (2023). Harmonizing DSM-IV and DSM-5 versions of ADHD “A Criteria”: An item response theory analysis. Assessment, 30(3), 606–617.CrossRef Google Scholar PubMed

Crutzen, R., Peters, G. J. Y. (2017). Scale quality: Alpha is an inadequate estimate and factor-analytic evidence is needed first of all. Health Psychology Review, 11(3), 242–247.CrossRef Google Scholar

Curran, P. J., Cole, V. T., Bauer, D. J., Rothenberg, W. A., Hussong, A. M. (2018). Recovering predictor-criterion relations using covariate-informed factor score estimates. Structural Equation Modeling, 25(6), 860–875.CrossRef Google Scholar PubMed

Curran, P. J., McGinley, J. S., Bauer, D. J., Hussong, A. M., Burns, A., Chassin, L., Zucker, R. (2014). A moderated nonlinear factor model for the development of commensurate measures in integrative data analysis. Multivariate Behavioral Research, 49(3), 214–231.CrossRef Google Scholar PubMed

Curran, P. J., Cole, V., Bauer, D. J., Hussong, A. M., Gottfredson, N. (2016). Improving factor score estimation through the use of observed background characteristics. Structural Equation Modeling, 23(6), 827–844.CrossRef Google Scholar PubMed

DiStefano, C., Zhu, M., Mindrila, D. (2019). Understanding and using factor scores: Considerations for the applied researcher. Practical Assessment, Research, and Evaluation, 14(20), 1–11.Google Scholar

Donnellan, E., Usami, S., & Murayama, K. (2023). Random item slope regression: An alternative measurement model that accounts for both similarities and differences in association with individual items. Psychological Methods, advance online publication.CrossRef Google Scholar

Edwards, M. C., Wirth, R. J. (2009). Measurement and the study of change. Research in Human Development, 6 2–374–96.CrossRef Google Scholar

Edwards, K. D., Soland, J. (2024). How scoring approaches impact estimates of growth in the presence of survey item ceiling effects. Applied Psychological Measurement, 48(3), 147–164.CrossRef Google Scholar PubMed

Embretson, S. E. (2007). Construct validity: A universal validity system or just another test evaluation procedure?. Educational Researcher, 36(8), 449–455.CrossRef Google Scholar

Embretson, S. E. (2004). The second century of ability testing: Some predictions and speculations.. Measurement: Interdisciplinary Research and Perspectives, 2(1), 1–32.Google Scholar

Embretson, S. E. (1996). Item response theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201–212.CrossRef Google Scholar

Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–197.Google Scholar

Epskamp, S., Rhemtulla, M., Borsboom, D. (2017). Generalized network psychometrics: Combining network and latent variable models. Psychometrika, 82, 904–927.CrossRef Google Scholar PubMed

Eronen, M. I., Bringmann, L. F. (2021). The theory crisis in psychology: How to move forward. Perspectives on Psychological Science, 16(4), 779–788.CrossRef Google Scholar PubMed

Evers, A., Lucassen, W., Meijer, R., & Sijtsma, K. (2015). COTAN review system for evaluating test quality. Retrieved February 19, 2024. https://www.psynip.nl/wp-content/uploads/2022/05/COTAN-review-system-for-evaluating-test-quality.pdf.Google Scholar

Evers, A. (2012). The internationalization of test reviewing: Trends, differences, and results. International Journal of Testing, 12(2), 136–156.CrossRef Google Scholar

Evers, A., Sijtsma, K., Lucassen, W., Meijer, R. R. (2010). The Dutch review process for evaluating the quality of psychological tests: History, procedure, and results. International Journal of Testing, 10(4), 295–317.CrossRef Google Scholar

Flake, J. K., Fried, E. I. (2020). Measurement schmeasurement: Questionable measurement practices and how to avoid them. Advances in Methods and Practices in Psychological Science, 3(4), 456–465.CrossRef Google Scholar

Flake, J. K., Pek, J., Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8(4), 370–378.CrossRef Google Scholar

Flake, J. K. (2021). Strengthening the foundation of educational psychology by integrating construct validation into open science reform. Educational Psychologist, 56(2), 132–141.CrossRef Google Scholar

Flake, J. K., Davidson, I. J., Wong, O., Pek, J. (2022). Construct validity and the validity of replication studies: A systematic review. American Psychologist, 77(4), 576–588.CrossRef Google Scholar PubMed

Foster, G. C., Min, H., Zickar, M. J. (2017). Review of item response theory practices in organizational research: Lessons learned and paths forward. Organizational Research Methods, 20(3), 465–486.CrossRef Google Scholar

Fraley, R. C., Waller, N. G., Brennan, K. A. (2000). An item response theory analysis of self-report measures of adult attachment. Journal of Personality and Social Psychology, 78(2), 350.CrossRef Google Scholar PubMed

Fried, E. I. (2020). Theories and models: What they are, what they are for, and what they are about. Psychological Inquiry, 31(4), 336–344.CrossRef Google Scholar

Fried, E. I. (2015). Problematic assumptions have slowed down depression research: Why symptoms, not syndromes are the way forward. Frontiers in Psychology, 6, 309.CrossRef Google Scholar

Fried, E. I., Nesse, R. M. (2015). Depression sum-scores don’t add up: Why analyzing specific depression symptoms is essential. BMC Medicine, 13(1), 1–11.CrossRef Google Scholar PubMed

Fried, E. I., Nesse, R. M. (2014). The impact of individual depressive symptoms on impairment of psychosocial functioning. PloS One, 9(2).CrossRef Google Scholar PubMed

Gonzalez, O. (2021). Psychometric and machine learning approaches for diagnostic assessment and tests of individual classification. Psychological Methods, 26(2), 236–254.CrossRef Google Scholar

Gonzalez, O., MacKinnon, D. P., Muniz, F. B. (2021). Extrinsic convergent validity evidence to prevent jingle and jangle fallacies. Multivariate Behavioral Research, 56(1), 3–19.CrossRef Google Scholar PubMed

Gorter, R., Fox, J. P., Riet, G. T., Heymans, M. W., Twisk, J. W. R. (2020). Latent growth modeling of IRT versus CTT measured longitudinal latent variables. Statistical Methods in Medical Research, 29(4), 962–986.CrossRef Google Scholar PubMed

Gorter, R., Fox, J. P., Apeldoorn, A., Twisk, J. (2016). Measurement model choice influenced randomized controlled trial results. Journal of Clinical Epidemiology, 79, 140–149.CrossRef Google Scholar PubMed

Gottfredson, N. C., Cole, V. T., Giordano, M. L., Bauer, D. J., Hussong, A. M., Ennett, S. T. (2019). Simplifying the implementation of modern scale scoring methods with an automated R package: Automated moderated nonlinear factor analysis (aMNLFA). Addictive Behaviors, 94, 65–73.CrossRef Google Scholar PubMed

Grice, J. W., Harris, R. J. (1998). A comparison of regression and loading weights for the computation of factor scores. Multivariate Behavioral Research, 33(2), 221–247.CrossRef Google Scholar PubMed

Grice, J. W. (2001). Computing and evaluating factor scores. Psychological Methods, 6(4), 430–450.CrossRef Google Scholar PubMed

Gunnell, K. E., Schellenberg, B. J., Wilson, P. M., Crocker, P. R., Mack, D. E., & Zumbo, B. D. (2014). A review of validity evidence presented in the journal of sport and exercise psychology (2002–2012): Misconceptions and recommendations for validation research. In B. D. Zumbo & E. K. Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 137-156). Springer.CrossRef Google Scholar

Hair, J. F., Sharma, P. N., Sarstedt, M., Ringle, C. M., Liengaard, B. D. (2024). The shortcomings of equal weights estimation and the composite equivalence index in PLS-SEM. European Journal of Marketing, 58(13), 30–55.CrossRef Google Scholar

Hancock, G. R., Mueller, R. O. (2001). Rethinking construct reliability within latent variable systems. In Cudeck, R., du Toit, S., Sorbom, D. (Eds), Structural equation modeling: Present and future—A festschrift in honor of Karl Joreskog, Scientific Software International 195–216.Google Scholar

Hemker, B. T., Sijtsma, K., Molenaar, I. W., Junker, B. W. (1996). Polytomous IRT models and monotone likelihood ratio of the total score. Psychometrika, 61(4), 679–693.CrossRef Google Scholar

Hemker, B. T., Sijtsma, K., Molenaar, I. W., Junker, B. W. (1997). Stochastic ordering using the latent trait and the sum score in polytomous IRT models. Psychometrika, 62(3), 331–347.CrossRef Google Scholar

Higgins, W. C., Kaplan, D. M., Deschrijver, E., Ross, R. M. (2023). Construct validity evidence reporting practices for the Reading the mind in the eyes test: A systematic scoping review. Clinical Psychology Review, 108.Google Scholar

Hogan, T. P., Agnello, J. (2004). An empirical study of reporting practices concerning measurement validity. Educational and Psychological Measurement, 64(5), 802–812.CrossRef Google Scholar

Hopwood, C. J., Donnellan, M. B. (2010). How should the internal structure of personality inventories be evaluated?. Personality and Social Psychology Review, 14(3), 332–346.CrossRef Google Scholar PubMed

Howard, A. L. (2024). Graduate students need more quantitative methods support. Nature Reviews Psychology, 3, 140–141.CrossRef Google Scholar

Hsiao, Y. Y., Kwok, O. M., Lai, M. H. (2018). Evaluation of two methods for modeling measurement errors when testing interaction effects with observed composite scores. Educational and Psychological Measurement, 78(2), 181–202.CrossRef Google Scholar PubMed

Huang, P. H. (2022). Penalized least squares for structural equation modeling with ordinal responses. Multivariate Behavioral Research, 57 2–3279–297.CrossRef Google Scholar PubMed

Hubley, A. M., Zhu, S. M., Sasaki, A., & Gadermann, A. M. (2014). Synthesis of validation practices in two assessment journals: Psychological Assessment and the European Journal of Psychological Assessment. In B. D. Zumbo & E. K. Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 193–213). Springer.CrossRef Google Scholar

Hussong, A. M., Gottfredson, N. C., Bauer, D. J., Curran, P. J., Haroon, M., Chandler, R., Springer, S. A. (2019). Approaches for creating comparable measures of alcohol use symptoms: Harmonization with eight studies of criminal justice populations. Drug and Alcohol Dependence, 194, 59–68.CrossRef Google Scholar PubMed

Hwang, H., Cho, G., Jung, K., Falk, C. F., Flake, J. K., Jin, M. J., Lee, S. H. (2021). An approach to structural equation modeling with both factors and components: Integrated generalized structured component analysis. Psychological Methods, 26(3), 273–294.CrossRef Google Scholar PubMed

Jackson, D. L., Gillaspy, JA JrPurc-Stephenson, R. (2009). Reporting practices in confirmatory factor analysis: An overview and some recommendations. Psychological Methods, 14(1), 6–23.CrossRef Google Scholar PubMed

Jacobucci, R., Grimm, K. J. (2020). Machine learning and psychological research: The unexplored effect of measurement. Perspectives on Psychological Science, 15(3), 809–816.CrossRef Google Scholar PubMed

Jacobucci, R., Grimm, K. J., McArdle, J. J. (2016). Regularized structural equation modeling. Structural Equation Modeling, 23(4), 555–566.CrossRef Google Scholar PubMed

Kane, M. T. (2006). Validation. In Brennan, R. L. (Eds), Educational measurement, 4 American Council on Education/Praeger 17–64.Google Scholar

Kang, S. M., Waller, N. G. (2005). Moderated multiple regression, spurious interaction effects, and IRT. Applied Psychological Measurement, 29(2), 87–105.CrossRef Google Scholar

Kessels, R., Moerbeek, M., Bloemers, J., van Der Heijden, P. G. (2021). A multilevel structural equation model for assessing a drug effect on a patient-reported outcome measure in on-demand medication data. Biometrical Journal, 63(8), 1652–1672.CrossRef Google Scholar PubMed

Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355–381.CrossRef Google Scholar

König, C., Khorramdel, L., Yamamoto, K., Frey, A. (2021). The benefits of fixed item parameter calibration for parameter accuracy in small sample situations in large-scale assessments. Educational Measurement: Issues and Practice, 40(1), 17–27.CrossRef Google Scholar

Kuhfeld, M., Soland, J. (2022). Avoiding bias from sum scores in growth estimates: An examination of IRT-based approaches to scoring longitudinal survey responses. Psychological Methods, 27(2), 234–260.CrossRef Google Scholar PubMed

Kuhfeld, M., & Soland, J. (2023). Scoring assessments in multisite randomized control trials: Examining the sensitivity of treatment effect estimates to measurement choices. Psychological Methods, advance online publication.CrossRef Google Scholar

Li, H., Rosenthal, R., Rubin, D. B. (1996). Reliability of measurement in psychology: From Spearman–Brown to maximal reliability. Psychological Methods, 1(1), 98–107.CrossRef Google Scholar

Li, X., Jacobucci, R. (2022). Regularized structural equation modeling with stability selection. Psychological Methods, 27(4), 497–518.CrossRef Google Scholar PubMed

Liang, X., Jacobucci, R. (2020). Regularized structural equation modeling to detect measurement bias: Evaluation of lasso, adaptive lasso, and elastic net. Structural Equation Modeling, 27(5), 722–734.CrossRef Google Scholar

Liu, Q., Wang, L. (2021). t-Test and ANOVA for data with ceiling and/or floor effects. Behavior Research Methods, 53(1), 264–277.CrossRef Google Scholar PubMed

Loken, E., Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584–585.CrossRef Google Scholar PubMed

Luningham, J. M., McArtor, D. B., Bartels, M., Boomsma, D. I., Lubke, G. H. (2017). Sum scores in twin growth curve models: Practicality versus bias. Behavior Genetics, 47, 516–536.CrossRef Google Scholar PubMed

Maassen, E., D’Urso, E. D., van Assen, M. A., Nuijten, M. B., De Roover, K., & Wicherts, J. M. (2024). The dire disregard of measurement invariance testing in psychological science. Psychological Methods, advance online publication.Google Scholar

Maxwell, S. E., Delaney, H. D. (1985). Measurement and statistics: An examination of construct validity. Psychological Bulletin, 97(1), 85–93.CrossRef Google Scholar

McClure, K., Ammerman, B. A., Jacobucci, R. (2024). On the selection of item scores or composite scores for clinical prediction. Multivariate Behavioral Research, 59(3), 566–583.CrossRef Google Scholar PubMed

McNeish, D., Wolf, M. G. (2020). Thinking twice about sum scores. Behavior Research Methods, 52, 2287–2305.CrossRef Google Scholar PubMed

McNeish, D. (2023). Psychometric properties of sum scores and factor scores differ even when their correlation is 0.98: A response to Widaman and Revelle. Behavior Research Methods, 55(8), 4269–4290.CrossRef Google Scholar PubMed

McNeish, D. (2023). Generalizability of dynamic fit index, equivalence testing, and Hu & Bentler cutoffs for evaluating fit in factor analysis. Multivariate Behavioral Research, 58(1), 195–219.CrossRef Google Scholar

McNeish, D. (2023). Dynamic fit index cutoffs for categorical factor analysis with Likert-type, ordinal, or binary responses. American Psychologist, 78(9), 1061–1075.CrossRef Google Scholar PubMed

Mislevy, R. J., Steinberg, L. S., Almond, R. G. (2002). On the role of task model variables in assessment design. In Irvine, S. H., Kyllonen, P. C. (Eds), Item generation for test development, Lawrence Erlbaum 97–128.Google Scholar

Mislevy, R. J., Almond, R. G., Lukas, J. F. (2003). A brief introduction to evidence-centered design. ETS Research Report Series, 2003(1), i–29.CrossRef Google Scholar

Morgan-López, A. A., Saavedra, L. M., Hien, D. A., Norman, S. B., Fitzpatrick, S. S., Ye, A., Back, S. E. (2023). Differential symptom weighting in estimating empirical thresholds for underlying PTSD severity: Toward a “platinum” standard for diagnosis?. International Journal of Methods in Psychiatric Research, 32(3).CrossRef Google Scholar

Morse, B. J., Johanson, G. A., Griffeth, R. W. (2012). Using the graded response model to control spurious interactions in moderated multiple regression. Applied Psychological Measurement, 36(2), 122–146.CrossRef Google Scholar

Müller, S., Hopwood, C. J., Skodol, A. E., Morey, L. C., Oltmanns, T. F., Benecke, C., Zimmermann, J. (2023). Exploring the predictive validity of personality disorder criteria. Personality Disorders: Theory, Research, and Treatment, 14(3), 309–320.CrossRef Google Scholar PubMed

Murray, A. L., Molenaar, D., Johnson, W., Krueger, R. F. (2016). Dependence of gene-by-environment interactions (GxE) on scaling: Comparing the use of sum scores, transformed sum scores and IRT scores for the phenotype in tests of GxE. Behavior Genetics, 46, 552–572.CrossRef Google Scholar PubMed

Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Dreber, A., Vazire, S. (2022). Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology, 73, 719–748.CrossRef Google Scholar PubMed

Padilla García, J. L., Benítez Baena, I. (2014). Validity evidence based on response processes. Psicothema, 26(1), 136–144.CrossRef Google Scholar

Pashler, H., Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence?. Perspectives on Psychological Science, 7(6), 528–530.CrossRef Google Scholar

Pelt, D. H., Schwabe, I., Bartels, M. (2023). Bias in gene-by-environment interaction effects with sum scores: An application to well-being phenotypes. Behavior Genetics, 53, 359–373.CrossRef Google Scholar PubMed

Peters, G. J., & Crutzen, R. (2024). Knowing what we’re talking about: Facilitating decentralized, unequivocal publication of and reference to psychological construct definitions and instructions. Meta-Psychology, 8, 1–27.CrossRef Google Scholar

Proust-Lima, C., Philipps, V., Dartigues, J. F., Bennett, D. A., Glymour, M. M., Jacqmin-Gadda, H., Samieri, C. (2019). Are latent variable models preferable to composite score approaches when assessing risk factors of change? Evaluation of type-I error and statistical power in longitudinal cognitive studies. Statistical Methods in Medical Research, 28(7), 1942–1957.CrossRef Google Scholar PubMed

Proust-Lima, C., Dartigues, J. F., Jacqmin-Gadda, H. (2011). Misuse of the linear mixed model when evaluating risk factors of cognitive decline. American Journal of Epidemiology, 174(9), 1077–1088.CrossRef Google Scholar PubMed

Pruzek, R. M., Frederick, B. C. (1978). Weighting predictors in linear models: Alternatives to least squares and limitations of equal weights. Psychological Bulletin, 85(2), 254–266.CrossRef Google Scholar

Qualls, A. L., Moss, A. D. (1996). The degree of congruence between test standards and test documentation within journal publications. Educational and Psychological Measurement, 56(2), 209–214.CrossRef Google Scholar

Ramsay, J. O., Wiberg, M. (2017). A strategy for replacing sum scoring. Journal of Educational and Behavioral Statistics, 42(3), 282–307.CrossRef Google Scholar

Reise, S. P., Henson, J. M. (2003). A discussion of modern versus traditional psychometrics as applied to personality assessment scales. Journal of Personality Assessment, 81(2), 93–103.CrossRef Google Scholar PubMed

Reise, S. P., Waller, N. G. (2009). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5, 27–48.CrossRef Google Scholar PubMed

Revelle, W. (2024). The seductive beauty of latent variable models: Or why I don’t believe in the Easter Bunny. Personality and Individual Differences, 221.CrossRef Google Scholar

Rhemtulla, M., van Bork, R., Borsboom, D. (2020). Worse than measurement error: Consequences of inappropriate latent variable measurement models. Psychological Methods, 25(1), 30–45.CrossRef Google Scholar PubMed

Rodgers, J. L., Shrout, P. E. (2018). Psychology’s replication crisis as scientific opportunity: A précis for policymakers. Policy Insights from the Behavioral and Brain Sciences, 5(1), 134–141.CrossRef Google Scholar

Rose, N., Wagner, W., Mayer, A., Nagengast, B. (2019). Model-based manifest and latent composite scores in structural equation models. Collabra: Psychology, 5(1), 9.CrossRef Google Scholar

Russell, D. W. (2002). In search of underlying dimensions: The use (and abuse) of factor analysis in personality and social psychology bulletin. Personality and Social Psychology bulletin, 28(12), 1629–1646.CrossRef Google Scholar

Schwabe, I., van den Berg, S. M. (2014). Assessing genotype by environment interaction in case of heterogeneous measurement error. Behavior Genetics, 44(4), 394–406.Google Scholar PubMed

Schimmack, U. (2021). The validation crisis in psychology. Meta-Psychology, 5, 1–9.CrossRef Google Scholar

Shaw, M., Cloos, L. J., Luong, R., Elbaz, S., Flake, J. K. (2020). Measurement practices in large-scale replications: Insights from Many Labs 2. Canadian Psychology/Psychologie Canadienne, 61(4), 289.CrossRef Google Scholar

Schreiber, J. B. (2021). Issues and recommendations for exploratory factor analysis and principal component analysis. Research in Social and Administrative Pharmacy, 17(5), 1004–1011.CrossRef Google Scholar PubMed

Shear, B. R., & Zumbo, B. D. (2014). What counts as evidence: A review of validity studies in educational and psychological measurement. In B. D. Zumbo & E. K. Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 91–111). Springer.CrossRef Google Scholar

Sijtsma, K. (2012). Future of psychometrics: Ask what psychometrics can do for psychology. Psychometrika, 77, 4–20.CrossRef Google Scholar

Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107–120.CrossRef Google Scholar PubMed

Sijtsma, K., Ellis, J. L., Borsboom, D. (2024). Recognize the value of the sum score, psychometrics’ greatest accomplishment. Psychometrika, 89(1), 84–117.CrossRef Google Scholar PubMed

Simmons, J. P., Nelson, L. D., Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.CrossRef Google Scholar PubMed

Sireci, S., Faulkner-Bond, M. (2014). Validity evidence based on test content. Psicothema, 26(1), 100–107.CrossRef Google Scholar PubMed

Sireci, S. G. (1998). The construct of content validity. Social Indicators Research, 45, 83–117.CrossRef Google Scholar

Sireci, S. G. (1998). Gathering and analyzing content validity data. Educational Assessment, 5(4), 299–321.CrossRef Google Scholar

Slof-Op’t Landt, M. C. T., van Furth, E. F., Rebollo-Mesa, I., Bartels, M., van Beijsterveldt, C. E. M., Slagboom, P. E., Dolan, C. V. (2009). Sex differences in sum scores may be hard to interpret: The importance of measurement invariance. Assessment, 16(4), 415–423.CrossRef Google Scholar PubMed

Soland, J. (2022). Evidence that selecting an appropriate item response theory-based approach to scoring surveys can help avoid biased treatment effect estimates. Educational and Psychological Measurement, 82(2), 376–403.CrossRef Google Scholar PubMed

Soland, J., Kuhfeld, M., & Edwards, K. (2022a). How survey scoring decisions can influence your study’s results: A trip through the IRT looking glass. Psychological Methods, advance online publication.CrossRef Google Scholar

Soland, J., McGinty, A., Gray, A., Solari, E. J., Herring, W., Xu, R. (2022). Early literacy, equity, and test score comparability during the pandemic. Educational Assessment, 27(2), 98–114.CrossRef Google Scholar

Soland, J., Johnson, A., Talbert, E. (2023). Regression discontinuity designs in a latent variable framework. Psychological Methods, 28(3), 691–704.CrossRef Google Scholar

Soland, J., Cole, V., Tavares, S., & Zhang, Q. (2024). Evidence that growth mixture model results are highly sensitive to scoring decisions. PsyArXiv. https://osf.io/preprints/psyarxiv/d27rcSpeelman.Google Scholar

Speelman, C. P., Parker, L., Rapley, B. J., McGann, M. (2024). Most psychological researchers assume their samples are ergodic: Evidence from a year of articles in three major journals. Collabra: Psychology, 10(1), 92888.CrossRef Google Scholar

Stochl, J., Fried, E. I., Fritz, J., Croudace, T. J., Russo, D. A., Knight, C., Perez, J. (2022). On dimensionality, measurement invariance, and suitability of sum scores for the PHQ-9 and the GAD-7. Assessment, 29(3), 355–366.CrossRef Google Scholar PubMed

Tackett, J. L., Brandes, C. M., King, K. M., Markon, K. E. (2019). Psychology’s replication crisis and clinical psychological science. Annual Review of Clinical Psychology, 15, 579–604.CrossRef Google Scholar PubMed

Tang, X., Schalet, B. D., Peipert, J. D., Cella, D. (2023). Does scoring method impact estimation of significant individual changes assessed by patient-reported outcome measures? Comparing classical test theory versus item response theory. Value in Health, 23(10), 1518–1524.CrossRef Google Scholar

Tay, L., Woo, S. E., Hickman, L., Saef, R. M. (2020). Psychometric and validity issues in machine learning approaches to personality assessment: A focus on social media text mining. European Journal of Personality, 34(5), 826–844.CrossRef Google Scholar

Thissen, D., Steinberg, L., Pyszczynski, T., Greenberg, J. (1983). An item response theory for personality and attitude scales: Item analysis using restricted factor analysis. Applied Psychological Measurement, 7(2), 211–226.CrossRef Google Scholar

Tsutakawa, R. K., Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55(2), 371–390.CrossRef Google Scholar

van den Oord, E. J., Pickles, A., Waldman, I. D. (2003). Normal variation and abnormality: An empirical study of the liability distributions underlying depression and delinquency. Journal of Child Psychology and Psychiatry, 44(2), 180–192.CrossRef Google Scholar PubMed

van den Oord, E. J., van der Ark, L. A. (1997). A note on the use of the Tobit approach for tests scores with floor or ceiling effects. British Journal of Mathematical and Statistical Psychology, 50(2), 351–364.CrossRef Google Scholar

van der Ark, L. A. (2005). Stochastic ordering of the latent trait by the sum score under various polytomous IRT models. Psychometrika, 70, 283–304.CrossRef Google Scholar

Vogelsmeier, L. V., Jongerling, J., & Maassen, E. (2024). Assessing and accounting for measurement in intensive longitudinal studies: Current practices, considerations, and avenues for improvement. Quality of Life Research, advance online publication.CrossRef Google Scholar

Vogelsmeier, L. V., Vermunt, J. K., Keijsers, L., De Roover, K. (2021). Latent Markov latent trait analysis for exploring measurement model changes in intensive longitudinal data. Evaluation & the Health Professions, 44(1), 61–76.CrossRef Google Scholar PubMed

Vogelsmeier, L. V., Vermunt, J. K., van Roekel, E., De Roover, K. (2019). Latent Markov factor analysis for exploring measurement model changes in time-intensive longitudinal studies. Structural Equation Modeling, 26(4), 557–575.CrossRef Google Scholar

Wainer, H. (1976). Estimating coefficients in linear models: It don’t make no nevermind. Psychological Bulletin, 83(2), 213–217.CrossRef Google Scholar

Weidman, A. C., Steckler, C. M., Tracy, J. L. (2017). The jingle and jangle of emotion assessment: Imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emotion, 17(2), 267–295.CrossRef Google Scholar PubMed

Wicherts, J. M., Veldkamp, C. L., Augusteijn, H. E., Bakker, M., van Aert, R., van Assen, M. A. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 7, 1832.CrossRef Google Scholar PubMed

Wilson, M., Allen, D. D., Li, J. C. (2006). Improving measurement in health education and health behavior research using item response modeling: comparison with the classical test theory approach. Health Education Research, 21 supplement 1i19–i32.CrossRef Google Scholar PubMed

Wolf, M. G. (2023). The problem with over-relying on quantitative evidence of validity. PsyArXiv. https://doi.org/10.31234/osf.io/v4nb2.CrossRef Google Scholar

Zwitser, R. J., Maris, G. (2016). Ordering individuals with sum scores: The introduction of the nonparametric Rasch model. Psychometrika, 81, 39–59.CrossRef Google Scholar PubMed

Article contents

Practical Implications of Sum Scores Being Psychometrics’ Greatest Accomplishment

Abstract

Keywords

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests