Hostname: page-component-5b777bbd6c-rbv74 Total loading time: 0 Render date: 2025-06-20T22:35:12.100Z Has data issue: false hasContentIssue false

Practical Implications of Sum Scores Being Psychometrics’ Greatest Accomplishment

Published online by Cambridge University Press:  01 January 2025

Daniel McNeish*
Affiliation:
Arizona State University
*
Correspondence should be made to Daniel McNeish, Department of Psychology, Arizona State University, PO Box 871104, Tempe, AZ 85287, USA. Email: dmcneish@asu.edu

Abstract

This paper reflects on some practical implications of the excellent treatment of sum scoring and classical test theory (CTT) by Sijtsma et al. (Psychometrika 89(1):84–117, 2024). I have no major disagreements about the content they present and found it to be an informative clarification of the properties and possible extensions of CTT. In this paper, I focus on whether sum scores—despite their mathematical justification—are positioned to improve psychometric practice in empirical studies in psychology, education, and adjacent areas. First, I summarize recent reviews of psychometric practice in empirical studies, subsequent calls for greater psychometric transparency and validity, and how sum scores may or may not be positioned to adhere to such calls. Second, I consider limitations of sum scores for prediction, especially in the presence of common features like ordinal or Likert response scales, multidimensional constructs, and moderated or heterogeneous associations. Third, I review previous research outlining potential limitations of using sum scores as outcomes in subsequent analyses where rank ordering is not always sufficient to successfully characterize group differences or change over time. Fourth, I cover potential challenges for providing validity evidence for whether sum scores represent a single construct, particularly if one wishes to maintain minimal CTT assumptions. I conclude with thoughts about whether sum scores—even if mathematically justified—are positioned to improve psychometric practice in empirical studies.

Type
Original Research
Copyright
© 2024 The Author(s), under exclusive licence to The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Adjerid, I., Kelley, K. (2018). Big data in psychology: A framework for research advancement. American Psychologist, 73(7), 899917.CrossRefGoogle ScholarPubMed
Aiken, L. S., West, S. G., Millsap, R. E. (2008). Doctoral training in statistics, measurement, and methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Reno’s (1990) survey of PhD programs in North America. American Psychologist, 63(1), 3250.CrossRefGoogle ScholarPubMed
Alexandrova, A., Haybron, D. M. (2016). Is construct validation valid?. Philosophy of Science, 83(5), 10981109.CrossRefGoogle Scholar
Altman, D. G., Bland, J. M. (1983). Measurement in medicine: The analysis of method comparison studies. Journal of the Royal Statistical Society, Series D: The Statistician, 32(3), 307317.Google Scholar
Angrist, J. D. (2004). American education research changes tack. Oxford Review of Economic Policy, 20(2), 198212.CrossRefGoogle Scholar
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.Google Scholar
Aiken, L. R. (1980). Content validity and reliability of single items or questionnaires. Educational and Psychological Measurement, 40(4), 955959.CrossRefGoogle Scholar
Bauer, D. J. (2017). A more general model for testing measurement invariance and differential item functioning. Psychological Methods, 22(3), 507526.CrossRefGoogle ScholarPubMed
Beauducel, A., Hilger, N. (2020). On the fit of models implied by unit-weighted scales. Communications in Statistics-Simulation and Computation, 49(11), 30543064.CrossRefGoogle Scholar
Beauducel, A., Leue, A. (2013). Unit-weighted scales imply models that should be tested!. Practical Assessment, Research & Evaluation, 18(1), 17.Google Scholar
Beauducel, A. (2007). In spite of indeterminacy many common factor score estimates yield an identical reproduced covariance matrix. Psychometrika, 72(3), 437441.CrossRefGoogle Scholar
Bleidorn, W., Hopwood, C. J. (2019). Using machine learning to advance personality assessment and theory. Personality and Social Psychology Review, 23(2), 190203.CrossRefGoogle ScholarPubMed
Borsboom, D., Mellenbergh, G. J. (2004). Why psychometrics is not pathological: A comment on Michell. Theory & Psychology, 14(1), 105120.CrossRefGoogle Scholar
Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psychometrics, Cambridge University Press.CrossRefGoogle Scholar
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71(3), 425440.CrossRefGoogle ScholarPubMed
Blanchin, M., Hardouin, J. B., Neel, T. L., Kubis, G., Blanchard, C., Mirallié, E., Sébille, V. (2011). Comparison of CTT and Rasch-based approaches for the analysis of longitudinal patient reported outcomes. Statistics in Medicine, 30(8), 825838.CrossRefGoogle ScholarPubMed
Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In Brennan, R. L. (Eds), Educational measurement, 4 Praeger 116.Google Scholar
Chinni, M. L., & Hubley, A. M. (2014). A research synthesis of validation practices used to evaluate the Satisfaction with Life Scale (SWLS). In B. D. Zumbo & E. K. Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 35–66). Springer.CrossRefGoogle Scholar
Christensen, A. P., Golino, H., Silvia, P. J. (2020). A psychometric network perspective on the validity and validation of personality trait questionnaires. European Journal of Personality, 34(6), 10951108.CrossRefGoogle Scholar
Cohen, J. (1990). Things i have learned (so far). American Psychologist, 45(12), 13041312.CrossRefGoogle Scholar
Collie, R. J., & Zumbo, B. D. (2014). Validity evidence in the journal of educational psychology: Documenting current practice and a comparison with earlier practice. In B. D. Zumbo & E. K. Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 113–135). Springer.CrossRefGoogle Scholar
Coxe, S., Sibley, M. H. (2023). Harmonizing DSM-IV and DSM-5 versions of ADHD “A Criteria”: An item response theory analysis. Assessment, 30(3), 606617.CrossRefGoogle ScholarPubMed
Crutzen, R., Peters, G. J. Y. (2017). Scale quality: Alpha is an inadequate estimate and factor-analytic evidence is needed first of all. Health Psychology Review, 11(3), 242247.CrossRefGoogle Scholar
Curran, P. J., Cole, V. T., Bauer, D. J., Rothenberg, W. A., Hussong, A. M. (2018). Recovering predictor-criterion relations using covariate-informed factor score estimates. Structural Equation Modeling, 25(6), 860875.CrossRefGoogle ScholarPubMed
Curran, P. J., McGinley, J. S., Bauer, D. J., Hussong, A. M., Burns, A., Chassin, L., Zucker, R. (2014). A moderated nonlinear factor model for the development of commensurate measures in integrative data analysis. Multivariate Behavioral Research, 49(3), 214231.CrossRefGoogle ScholarPubMed
Curran, P. J., Cole, V., Bauer, D. J., Hussong, A. M., Gottfredson, N. (2016). Improving factor score estimation through the use of observed background characteristics. Structural Equation Modeling, 23(6), 827844.CrossRefGoogle ScholarPubMed
DiStefano, C., Zhu, M., Mindrila, D. (2019). Understanding and using factor scores: Considerations for the applied researcher. Practical Assessment, Research, and Evaluation, 14(20), 111.Google Scholar
Donnellan, E., Usami, S., & Murayama, K. (2023). Random item slope regression: An alternative measurement model that accounts for both similarities and differences in association with individual items. Psychological Methods, advance online publication.CrossRefGoogle Scholar
Edwards, M. C., Wirth, R. J. (2009). Measurement and the study of change. Research in Human Development, 6 2–37496.CrossRefGoogle Scholar
Edwards, K. D., Soland, J. (2024). How scoring approaches impact estimates of growth in the presence of survey item ceiling effects. Applied Psychological Measurement, 48(3), 147164.CrossRefGoogle ScholarPubMed
Embretson, S. E. (2007). Construct validity: A universal validity system or just another test evaluation procedure?. Educational Researcher, 36(8), 449455.CrossRefGoogle Scholar
Embretson, S. E. (2004). The second century of ability testing: Some predictions and speculations.. Measurement: Interdisciplinary Research and Perspectives, 2(1), 132.Google Scholar
Embretson, S. E. (1996). Item response theory models and spurious interaction effects in factorial ANOVA designs. Applied Psychological Measurement, 20(3), 201212.CrossRefGoogle Scholar
Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179197.Google Scholar
Epskamp, S., Rhemtulla, M., Borsboom, D. (2017). Generalized network psychometrics: Combining network and latent variable models. Psychometrika, 82, 904927.CrossRefGoogle ScholarPubMed
Eronen, M. I., Bringmann, L. F. (2021). The theory crisis in psychology: How to move forward. Perspectives on Psychological Science, 16(4), 779788.CrossRefGoogle ScholarPubMed
Evers, A., Lucassen, W., Meijer, R., & Sijtsma, K. (2015). COTAN review system for evaluating test quality. Retrieved February 19, 2024. https://www.psynip.nl/wp-content/uploads/2022/05/COTAN-review-system-for-evaluating-test-quality.pdf.Google Scholar
Evers, A. (2012). The internationalization of test reviewing: Trends, differences, and results. International Journal of Testing, 12(2), 136156.CrossRefGoogle Scholar
Evers, A., Sijtsma, K., Lucassen, W., Meijer, R. R. (2010). The Dutch review process for evaluating the quality of psychological tests: History, procedure, and results. International Journal of Testing, 10(4), 295317.CrossRefGoogle Scholar
Flake, J. K., Fried, E. I. (2020). Measurement schmeasurement: Questionable measurement practices and how to avoid them. Advances in Methods and Practices in Psychological Science, 3(4), 456465.CrossRefGoogle Scholar
Flake, J. K., Pek, J., Hehman, E. (2017). Construct validation in social and personality research: Current practice and recommendations. Social Psychological and Personality Science, 8(4), 370378.CrossRefGoogle Scholar
Flake, J. K. (2021). Strengthening the foundation of educational psychology by integrating construct validation into open science reform. Educational Psychologist, 56(2), 132141.CrossRefGoogle Scholar
Flake, J. K., Davidson, I. J., Wong, O., Pek, J. (2022). Construct validity and the validity of replication studies: A systematic review. American Psychologist, 77(4), 576588.CrossRefGoogle ScholarPubMed
Foster, G. C., Min, H., Zickar, M. J. (2017). Review of item response theory practices in organizational research: Lessons learned and paths forward. Organizational Research Methods, 20(3), 465486.CrossRefGoogle Scholar
Fraley, R. C., Waller, N. G., Brennan, K. A. (2000). An item response theory analysis of self-report measures of adult attachment. Journal of Personality and Social Psychology, 78(2), 350.CrossRefGoogle ScholarPubMed
Fried, E. I. (2020). Theories and models: What they are, what they are for, and what they are about. Psychological Inquiry, 31(4), 336344.CrossRefGoogle Scholar
Fried, E. I. (2015). Problematic assumptions have slowed down depression research: Why symptoms, not syndromes are the way forward. Frontiers in Psychology, 6, 309.CrossRefGoogle Scholar
Fried, E. I., Nesse, R. M. (2015). Depression sum-scores don’t add up: Why analyzing specific depression symptoms is essential. BMC Medicine, 13(1), 111.CrossRefGoogle ScholarPubMed
Fried, E. I., Nesse, R. M. (2014). The impact of individual depressive symptoms on impairment of psychosocial functioning. PloS One, 9(2).CrossRefGoogle ScholarPubMed
Gonzalez, O. (2021). Psychometric and machine learning approaches for diagnostic assessment and tests of individual classification. Psychological Methods, 26(2), 236254.CrossRefGoogle Scholar
Gonzalez, O., MacKinnon, D. P., Muniz, F. B. (2021). Extrinsic convergent validity evidence to prevent jingle and jangle fallacies. Multivariate Behavioral Research, 56(1), 319.CrossRefGoogle ScholarPubMed
Gorter, R., Fox, J. P., Riet, G. T., Heymans, M. W., Twisk, J. W. R. (2020). Latent growth modeling of IRT versus CTT measured longitudinal latent variables. Statistical Methods in Medical Research, 29(4), 962986.CrossRefGoogle ScholarPubMed
Gorter, R., Fox, J. P., Apeldoorn, A., Twisk, J. (2016). Measurement model choice influenced randomized controlled trial results. Journal of Clinical Epidemiology, 79, 140149.CrossRefGoogle ScholarPubMed
Gottfredson, N. C., Cole, V. T., Giordano, M. L., Bauer, D. J., Hussong, A. M., Ennett, S. T. (2019). Simplifying the implementation of modern scale scoring methods with an automated R package: Automated moderated nonlinear factor analysis (aMNLFA). Addictive Behaviors, 94, 6573.CrossRefGoogle ScholarPubMed
Grice, J. W., Harris, R. J. (1998). A comparison of regression and loading weights for the computation of factor scores. Multivariate Behavioral Research, 33(2), 221247.CrossRefGoogle ScholarPubMed
Grice, J. W. (2001). Computing and evaluating factor scores. Psychological Methods, 6(4), 430450.CrossRefGoogle ScholarPubMed
Gunnell, K. E., Schellenberg, B. J., Wilson, P. M., Crocker, P. R., Mack, D. E., & Zumbo, B. D. (2014). A review of validity evidence presented in the journal of sport and exercise psychology (2002–2012): Misconceptions and recommendations for validation research. In B. D. Zumbo & E. K. Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 137-156). Springer.CrossRefGoogle Scholar
Hair, J. F., Sharma, P. N., Sarstedt, M., Ringle, C. M., Liengaard, B. D. (2024). The shortcomings of equal weights estimation and the composite equivalence index in PLS-SEM. European Journal of Marketing, 58(13), 3055.CrossRefGoogle Scholar
Hancock, G. R., Mueller, R. O. (2001). Rethinking construct reliability within latent variable systems. In Cudeck, R., du Toit, S., Sorbom, D. (Eds), Structural equation modeling: Present and future—A festschrift in honor of Karl Joreskog, Scientific Software International 195216.Google Scholar
Hemker, B. T., Sijtsma, K., Molenaar, I. W., Junker, B. W. (1996). Polytomous IRT models and monotone likelihood ratio of the total score. Psychometrika, 61(4), 679693.CrossRefGoogle Scholar
Hemker, B. T., Sijtsma, K., Molenaar, I. W., Junker, B. W. (1997). Stochastic ordering using the latent trait and the sum score in polytomous IRT models. Psychometrika, 62(3), 331347.CrossRefGoogle Scholar
Higgins, W. C., Kaplan, D. M., Deschrijver, E., Ross, R. M. (2023). Construct validity evidence reporting practices for the Reading the mind in the eyes test: A systematic scoping review. Clinical Psychology Review, 108.Google Scholar
Hogan, T. P., Agnello, J. (2004). An empirical study of reporting practices concerning measurement validity. Educational and Psychological Measurement, 64(5), 802812.CrossRefGoogle Scholar
Hopwood, C. J., Donnellan, M. B. (2010). How should the internal structure of personality inventories be evaluated?. Personality and Social Psychology Review, 14(3), 332346.CrossRefGoogle ScholarPubMed
Howard, A. L. (2024). Graduate students need more quantitative methods support. Nature Reviews Psychology, 3, 140141.CrossRefGoogle Scholar
Hsiao, Y. Y., Kwok, O. M., Lai, M. H. (2018). Evaluation of two methods for modeling measurement errors when testing interaction effects with observed composite scores. Educational and Psychological Measurement, 78(2), 181202.CrossRefGoogle ScholarPubMed
Huang, P. H. (2022). Penalized least squares for structural equation modeling with ordinal responses. Multivariate Behavioral Research, 57 2–3279297.CrossRefGoogle ScholarPubMed
Hubley, A. M., Zhu, S. M., Sasaki, A., & Gadermann, A. M. (2014). Synthesis of validation practices in two assessment journals: Psychological Assessment and the European Journal of Psychological Assessment. In B. D. Zumbo & E. K. Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 193–213). Springer.CrossRefGoogle Scholar
Hussong, A. M., Gottfredson, N. C., Bauer, D. J., Curran, P. J., Haroon, M., Chandler, R., Springer, S. A. (2019). Approaches for creating comparable measures of alcohol use symptoms: Harmonization with eight studies of criminal justice populations. Drug and Alcohol Dependence, 194, 5968.CrossRefGoogle ScholarPubMed
Hwang, H., Cho, G., Jung, K., Falk, C. F., Flake, J. K., Jin, M. J., Lee, S. H. (2021). An approach to structural equation modeling with both factors and components: Integrated generalized structured component analysis. Psychological Methods, 26(3), 273294.CrossRefGoogle ScholarPubMed
Jackson, D. L., Gillaspy, JA JrPurc-Stephenson, R. (2009). Reporting practices in confirmatory factor analysis: An overview and some recommendations. Psychological Methods, 14(1), 623.CrossRefGoogle ScholarPubMed
Jacobucci, R., Grimm, K. J. (2020). Machine learning and psychological research: The unexplored effect of measurement. Perspectives on Psychological Science, 15(3), 809816.CrossRefGoogle ScholarPubMed
Jacobucci, R., Grimm, K. J., McArdle, J. J. (2016). Regularized structural equation modeling. Structural Equation Modeling, 23(4), 555566.CrossRefGoogle ScholarPubMed
Kane, M. T. (2006). Validation. In Brennan, R. L. (Eds), Educational measurement, 4 American Council on Education/Praeger 1764.Google Scholar
Kang, S. M., Waller, N. G. (2005). Moderated multiple regression, spurious interaction effects, and IRT. Applied Psychological Measurement, 29(2), 87105.CrossRefGoogle Scholar
Kessels, R., Moerbeek, M., Bloemers, J., van Der Heijden, P. G. (2021). A multilevel structural equation model for assessing a drug effect on a patient-reported outcome measure in on-demand medication data. Biometrical Journal, 63(8), 16521672.CrossRefGoogle ScholarPubMed
Kim, S. (2006). A comparative study of IRT fixed parameter calibration methods. Journal of Educational Measurement, 43(4), 355381.CrossRefGoogle Scholar
König, C., Khorramdel, L., Yamamoto, K., Frey, A. (2021). The benefits of fixed item parameter calibration for parameter accuracy in small sample situations in large-scale assessments. Educational Measurement: Issues and Practice, 40(1), 1727.CrossRefGoogle Scholar
Kuhfeld, M., Soland, J. (2022). Avoiding bias from sum scores in growth estimates: An examination of IRT-based approaches to scoring longitudinal survey responses. Psychological Methods, 27(2), 234260.CrossRefGoogle ScholarPubMed
Kuhfeld, M., & Soland, J. (2023). Scoring assessments in multisite randomized control trials: Examining the sensitivity of treatment effect estimates to measurement choices. Psychological Methods, advance online publication.CrossRefGoogle Scholar
Li, H., Rosenthal, R., Rubin, D. B. (1996). Reliability of measurement in psychology: From Spearman–Brown to maximal reliability. Psychological Methods, 1(1), 98107.CrossRefGoogle Scholar
Li, X., Jacobucci, R. (2022). Regularized structural equation modeling with stability selection. Psychological Methods, 27(4), 497518.CrossRefGoogle ScholarPubMed
Liang, X., Jacobucci, R. (2020). Regularized structural equation modeling to detect measurement bias: Evaluation of lasso, adaptive lasso, and elastic net. Structural Equation Modeling, 27(5), 722734.CrossRefGoogle Scholar
Liu, Q., Wang, L. (2021). t-Test and ANOVA for data with ceiling and/or floor effects. Behavior Research Methods, 53(1), 264277.CrossRefGoogle ScholarPubMed
Loken, E., Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584585.CrossRefGoogle ScholarPubMed
Luningham, J. M., McArtor, D. B., Bartels, M., Boomsma, D. I., Lubke, G. H. (2017). Sum scores in twin growth curve models: Practicality versus bias. Behavior Genetics, 47, 516536.CrossRefGoogle ScholarPubMed
Maassen, E., D’Urso, E. D., van Assen, M. A., Nuijten, M. B., De Roover, K., & Wicherts, J. M. (2024). The dire disregard of measurement invariance testing in psychological science. Psychological Methods, advance online publication.Google Scholar
Maxwell, S. E., Delaney, H. D. (1985). Measurement and statistics: An examination of construct validity. Psychological Bulletin, 97(1), 8593.CrossRefGoogle Scholar
McClure, K., Ammerman, B. A., Jacobucci, R. (2024). On the selection of item scores or composite scores for clinical prediction. Multivariate Behavioral Research, 59(3), 566583.CrossRefGoogle ScholarPubMed
McNeish, D., Wolf, M. G. (2020). Thinking twice about sum scores. Behavior Research Methods, 52, 22872305.CrossRefGoogle ScholarPubMed
McNeish, D. (2023). Psychometric properties of sum scores and factor scores differ even when their correlation is 0.98: A response to Widaman and Revelle. Behavior Research Methods, 55(8), 42694290.CrossRefGoogle ScholarPubMed
McNeish, D. (2023). Generalizability of dynamic fit index, equivalence testing, and Hu & Bentler cutoffs for evaluating fit in factor analysis. Multivariate Behavioral Research, 58(1), 195219.CrossRefGoogle Scholar
McNeish, D. (2023). Dynamic fit index cutoffs for categorical factor analysis with Likert-type, ordinal, or binary responses. American Psychologist, 78(9), 10611075.CrossRefGoogle ScholarPubMed
Mislevy, R. J., Steinberg, L. S., Almond, R. G. (2002). On the role of task model variables in assessment design. In Irvine, S. H., Kyllonen, P. C. (Eds), Item generation for test development, Lawrence Erlbaum 97128.Google Scholar
Mislevy, R. J., Almond, R. G., Lukas, J. F. (2003). A brief introduction to evidence-centered design. ETS Research Report Series, 2003(1), i29.CrossRefGoogle Scholar
Morgan-López, A. A., Saavedra, L. M., Hien, D. A., Norman, S. B., Fitzpatrick, S. S., Ye, A., Back, S. E. (2023). Differential symptom weighting in estimating empirical thresholds for underlying PTSD severity: Toward a “platinum” standard for diagnosis?. International Journal of Methods in Psychiatric Research, 32(3).CrossRefGoogle Scholar
Morse, B. J., Johanson, G. A., Griffeth, R. W. (2012). Using the graded response model to control spurious interactions in moderated multiple regression. Applied Psychological Measurement, 36(2), 122146.CrossRefGoogle Scholar
Müller, S., Hopwood, C. J., Skodol, A. E., Morey, L. C., Oltmanns, T. F., Benecke, C., Zimmermann, J. (2023). Exploring the predictive validity of personality disorder criteria. Personality Disorders: Theory, Research, and Treatment, 14(3), 309320.CrossRefGoogle ScholarPubMed
Murray, A. L., Molenaar, D., Johnson, W., Krueger, R. F. (2016). Dependence of gene-by-environment interactions (GxE) on scaling: Comparing the use of sum scores, transformed sum scores and IRT scores for the phenotype in tests of GxE. Behavior Genetics, 46, 552572.CrossRefGoogle ScholarPubMed
Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Dreber, A., Vazire, S. (2022). Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology, 73, 719748.CrossRefGoogle ScholarPubMed
Padilla García, J. L., Benítez Baena, I. (2014). Validity evidence based on response processes. Psicothema, 26(1), 136144.CrossRefGoogle Scholar
Pashler, H., Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence?. Perspectives on Psychological Science, 7(6), 528530.CrossRefGoogle Scholar
Pelt, D. H., Schwabe, I., Bartels, M. (2023). Bias in gene-by-environment interaction effects with sum scores: An application to well-being phenotypes. Behavior Genetics, 53, 359373.CrossRefGoogle ScholarPubMed
Peters, G. J., & Crutzen, R. (2024). Knowing what we’re talking about: Facilitating decentralized, unequivocal publication of and reference to psychological construct definitions and instructions. Meta-Psychology, 8, 1–27.CrossRefGoogle Scholar
Proust-Lima, C., Philipps, V., Dartigues, J. F., Bennett, D. A., Glymour, M. M., Jacqmin-Gadda, H., Samieri, C. (2019). Are latent variable models preferable to composite score approaches when assessing risk factors of change? Evaluation of type-I error and statistical power in longitudinal cognitive studies. Statistical Methods in Medical Research, 28(7), 19421957.CrossRefGoogle ScholarPubMed
Proust-Lima, C., Dartigues, J. F., Jacqmin-Gadda, H. (2011). Misuse of the linear mixed model when evaluating risk factors of cognitive decline. American Journal of Epidemiology, 174(9), 10771088.CrossRefGoogle ScholarPubMed
Pruzek, R. M., Frederick, B. C. (1978). Weighting predictors in linear models: Alternatives to least squares and limitations of equal weights. Psychological Bulletin, 85(2), 254266.CrossRefGoogle Scholar
Qualls, A. L., Moss, A. D. (1996). The degree of congruence between test standards and test documentation within journal publications. Educational and Psychological Measurement, 56(2), 209214.CrossRefGoogle Scholar
Ramsay, J. O., Wiberg, M. (2017). A strategy for replacing sum scoring. Journal of Educational and Behavioral Statistics, 42(3), 282307.CrossRefGoogle Scholar
Reise, S. P., Henson, J. M. (2003). A discussion of modern versus traditional psychometrics as applied to personality assessment scales. Journal of Personality Assessment, 81(2), 93103.CrossRefGoogle ScholarPubMed
Reise, S. P., Waller, N. G. (2009). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5, 2748.CrossRefGoogle ScholarPubMed
Revelle, W. (2024). The seductive beauty of latent variable models: Or why I don’t believe in the Easter Bunny. Personality and Individual Differences, 221.CrossRefGoogle Scholar
Rhemtulla, M., van Bork, R., Borsboom, D. (2020). Worse than measurement error: Consequences of inappropriate latent variable measurement models. Psychological Methods, 25(1), 3045.CrossRefGoogle ScholarPubMed
Rodgers, J. L., Shrout, P. E. (2018). Psychology’s replication crisis as scientific opportunity: A précis for policymakers. Policy Insights from the Behavioral and Brain Sciences, 5(1), 134141.CrossRefGoogle Scholar
Rose, N., Wagner, W., Mayer, A., Nagengast, B. (2019). Model-based manifest and latent composite scores in structural equation models. Collabra: Psychology, 5(1), 9.CrossRefGoogle Scholar
Russell, D. W. (2002). In search of underlying dimensions: The use (and abuse) of factor analysis in personality and social psychology bulletin. Personality and Social Psychology bulletin, 28(12), 16291646.CrossRefGoogle Scholar
Schwabe, I., van den Berg, S. M. (2014). Assessing genotype by environment interaction in case of heterogeneous measurement error. Behavior Genetics, 44(4), 394406.Google ScholarPubMed
Schimmack, U. (2021). The validation crisis in psychology. Meta-Psychology, 5, 1–9.CrossRefGoogle Scholar
Shaw, M., Cloos, L. J., Luong, R., Elbaz, S., Flake, J. K. (2020). Measurement practices in large-scale replications: Insights from Many Labs 2. Canadian Psychology/Psychologie Canadienne, 61(4), 289.CrossRefGoogle Scholar
Schreiber, J. B. (2021). Issues and recommendations for exploratory factor analysis and principal component analysis. Research in Social and Administrative Pharmacy, 17(5), 10041011.CrossRefGoogle ScholarPubMed
Shear, B. R., & Zumbo, B. D. (2014). What counts as evidence: A review of validity studies in educational and psychological measurement. In B. D. Zumbo & E. K. Chan (Eds.), Validity and validation in social, behavioral, and health sciences (pp. 91–111). Springer.CrossRefGoogle Scholar
Sijtsma, K. (2012). Future of psychometrics: Ask what psychometrics can do for psychology. Psychometrika, 77, 420.CrossRefGoogle Scholar
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74, 107120.CrossRefGoogle ScholarPubMed
Sijtsma, K., Ellis, J. L., Borsboom, D. (2024). Recognize the value of the sum score, psychometrics’ greatest accomplishment. Psychometrika, 89(1), 84117.CrossRefGoogle ScholarPubMed
Simmons, J. P., Nelson, L. D., Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 13591366.CrossRefGoogle ScholarPubMed
Sireci, S., Faulkner-Bond, M. (2014). Validity evidence based on test content. Psicothema, 26(1), 100107.CrossRefGoogle ScholarPubMed
Sireci, S. G. (1998). The construct of content validity. Social Indicators Research, 45, 83117.CrossRefGoogle Scholar
Sireci, S. G. (1998). Gathering and analyzing content validity data. Educational Assessment, 5(4), 299321.CrossRefGoogle Scholar
Slof-Op’t Landt, M. C. T., van Furth, E. F., Rebollo-Mesa, I., Bartels, M., van Beijsterveldt, C. E. M., Slagboom, P. E., Dolan, C. V. (2009). Sex differences in sum scores may be hard to interpret: The importance of measurement invariance. Assessment, 16(4), 415423.CrossRefGoogle ScholarPubMed
Soland, J. (2022). Evidence that selecting an appropriate item response theory-based approach to scoring surveys can help avoid biased treatment effect estimates. Educational and Psychological Measurement, 82(2), 376403.CrossRefGoogle ScholarPubMed
Soland, J., Kuhfeld, M., & Edwards, K. (2022a). How survey scoring decisions can influence your study’s results: A trip through the IRT looking glass. Psychological Methods, advance online publication.CrossRefGoogle Scholar
Soland, J., McGinty, A., Gray, A., Solari, E. J., Herring, W., Xu, R. (2022). Early literacy, equity, and test score comparability during the pandemic. Educational Assessment, 27(2), 98114.CrossRefGoogle Scholar
Soland, J., Johnson, A., Talbert, E. (2023). Regression discontinuity designs in a latent variable framework. Psychological Methods, 28(3), 691704.CrossRefGoogle Scholar
Soland, J., Cole, V., Tavares, S., & Zhang, Q. (2024). Evidence that growth mixture model results are highly sensitive to scoring decisions. PsyArXiv. https://osf.io/preprints/psyarxiv/d27rcSpeelman.Google Scholar
Speelman, C. P., Parker, L., Rapley, B. J., McGann, M. (2024). Most psychological researchers assume their samples are ergodic: Evidence from a year of articles in three major journals. Collabra: Psychology, 10(1), 92888.CrossRefGoogle Scholar
Stochl, J., Fried, E. I., Fritz, J., Croudace, T. J., Russo, D. A., Knight, C., Perez, J. (2022). On dimensionality, measurement invariance, and suitability of sum scores for the PHQ-9 and the GAD-7. Assessment, 29(3), 355366.CrossRefGoogle ScholarPubMed
Tackett, J. L., Brandes, C. M., King, K. M., Markon, K. E. (2019). Psychology’s replication crisis and clinical psychological science. Annual Review of Clinical Psychology, 15, 579604.CrossRefGoogle ScholarPubMed
Tang, X., Schalet, B. D., Peipert, J. D., Cella, D. (2023). Does scoring method impact estimation of significant individual changes assessed by patient-reported outcome measures? Comparing classical test theory versus item response theory. Value in Health, 23(10), 15181524.CrossRefGoogle Scholar
Tay, L., Woo, S. E., Hickman, L., Saef, R. M. (2020). Psychometric and validity issues in machine learning approaches to personality assessment: A focus on social media text mining. European Journal of Personality, 34(5), 826844.CrossRefGoogle Scholar
Thissen, D., Steinberg, L., Pyszczynski, T., Greenberg, J. (1983). An item response theory for personality and attitude scales: Item analysis using restricted factor analysis. Applied Psychological Measurement, 7(2), 211226.CrossRefGoogle Scholar
Tsutakawa, R. K., Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55(2), 371390.CrossRefGoogle Scholar
van den Oord, E. J., Pickles, A., Waldman, I. D. (2003). Normal variation and abnormality: An empirical study of the liability distributions underlying depression and delinquency. Journal of Child Psychology and Psychiatry, 44(2), 180192.CrossRefGoogle ScholarPubMed
van den Oord, E. J., van der Ark, L. A. (1997). A note on the use of the Tobit approach for tests scores with floor or ceiling effects. British Journal of Mathematical and Statistical Psychology, 50(2), 351364.CrossRefGoogle Scholar
van der Ark, L. A. (2005). Stochastic ordering of the latent trait by the sum score under various polytomous IRT models. Psychometrika, 70, 283304.CrossRefGoogle Scholar
Vogelsmeier, L. V., Jongerling, J., & Maassen, E. (2024). Assessing and accounting for measurement in intensive longitudinal studies: Current practices, considerations, and avenues for improvement. Quality of Life Research, advance online publication.CrossRefGoogle Scholar
Vogelsmeier, L. V., Vermunt, J. K., Keijsers, L., De Roover, K. (2021). Latent Markov latent trait analysis for exploring measurement model changes in intensive longitudinal data. Evaluation & the Health Professions, 44(1), 6176.CrossRefGoogle ScholarPubMed
Vogelsmeier, L. V., Vermunt, J. K., van Roekel, E., De Roover, K. (2019). Latent Markov factor analysis for exploring measurement model changes in time-intensive longitudinal studies. Structural Equation Modeling, 26(4), 557575.CrossRefGoogle Scholar
Wainer, H. (1976). Estimating coefficients in linear models: It don’t make no nevermind. Psychological Bulletin, 83(2), 213217.CrossRefGoogle Scholar
Weidman, A. C., Steckler, C. M., Tracy, J. L. (2017). The jingle and jangle of emotion assessment: Imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emotion, 17(2), 267295.CrossRefGoogle ScholarPubMed
Wicherts, J. M., Veldkamp, C. L., Augusteijn, H. E., Bakker, M., van Aert, R., van Assen, M. A. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 7, 1832.CrossRefGoogle ScholarPubMed
Wilson, M., Allen, D. D., Li, J. C. (2006). Improving measurement in health education and health behavior research using item response modeling: comparison with the classical test theory approach. Health Education Research, 21 supplement 1i19i32.CrossRefGoogle ScholarPubMed
Wolf, M. G. (2023). The problem with over-relying on quantitative evidence of validity. PsyArXiv. https://doi.org/10.31234/osf.io/v4nb2.CrossRefGoogle Scholar
Zwitser, R. J., Maris, G. (2016). Ordering individuals with sum scores: The introduction of the nonparametric Rasch model. Psychometrika, 81, 3959.CrossRefGoogle ScholarPubMed