Hostname: page-component-68c7f8b79f-kpv4p Total loading time: 0 Render date: 2025-12-31T00:34:22.540Z Has data issue: false hasContentIssue false

Adjusted Residuals for Evaluating Conditional Independence in IRT Models for Multistage Adaptive Testing

Published online by Cambridge University Press:  01 January 2025

Peter W. van Rijn*
Affiliation:
ETS Global
Usama S. Ali
Affiliation:
Educational Testing Service South Valley University
Hyo Jeong Shin
Affiliation:
Sogang UniversitySeoul
Sean-Hwane Joo
Affiliation:
University of Kansas
*
Correspondence should be made to Peter W. van Rijn, ETS Global, Amsterdam, The Netherlands. Email: pvanrijn@etsglobal.org

Abstract

The key assumption of conditional independence of item responses given latent ability in item response theory (IRT) models is addressed for multistage adaptive testing (MST) designs. Routing decisions in MST designs can cause patterns in the data that are not accounted for by the IRT model. This phenomenon relates to quasi-independence in log-linear models for incomplete contingency tables and impacts certain types of statistical inference based on assumptions on observed and missing data. We demonstrate that generalized residuals for item pair frequencies under IRT models as discussed by Haberman and Sinharay (J Am Stat Assoc 108:1435–1444, 2013. https://doi.org/10.1080/01621459.2013.835660) are inappropriate for MST data without adjustments. The adjustments are dependent on the MST design, and can quickly become nontrivial as the complexity of the routing increases. However, the adjusted residuals are found to have satisfactory Type I errors in a simulation and illustrated by an application to real MST data from the Programme for International Student Assessment (PISA). Implications and suggestions for statistical inference with MST designs are discussed.

Information

Type
Theory and Methods
Copyright
Copyright © 2023 The Author(s) under exclusive licence to The Psychometric Society.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Ali, U. S., Shin, H. J., & van Rijn, P. W. (in press). Applicability of traditional statistical methods to multistage test data. In D. Yan & A. von Davier (Eds.), Research for practical issues and solutions in computerized multistage testing. Taylor and Francis.Google Scholar
Berger, M.P.. (1992). Sequential sampling designs for the two-parameter item response theory model. Psychometrika, 57, 521538.CrossRefGoogle Scholar
Bishop, Y.M., Fienberg, S.E., Holland, P.W.Discrete multivariate analysis: Theory and practice 2007 Springer.Google Scholar
Bock, R.D.. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 2951.CrossRefGoogle Scholar
Bock, R.D., Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443459.CrossRefGoogle Scholar
Cai, L, Hansen, M. (2013). Limited-information goodness-of-fit testing of hierarchical item factor models. British Journal of Mathematical and Statistical Psychology, 66, 245276.CrossRefGoogle ScholarPubMed
Chalmers, R. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48 6129.CrossRefGoogle Scholar
Chen, W-H, Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265289.CrossRefGoogle Scholar
Christensen, K. B., Makransky, G., & Horton, M. (2017). Critical values for Yen’s Q3: Identification of local dependence in the Rasch model using residual correlations. Applied Psychological Measurement, 41(3), 178194.CrossRefGoogle Scholar
Eggen, TJHM, Verhelst, N.D.. (2011). Item calibration in incomplete testing designs. Psicológica, 32 1107132.Google Scholar
Gibbons, R.D., Hedeker, D. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423436.CrossRefGoogle Scholar
Glas, CAW. (1988). The Rasch model and multistage testing. Journal of Educational Statistics, 13, 4552.CrossRefGoogle Scholar
Glas, CAWContributions to estimating and testing rasch models (Unpublished doctoral dissertation) 1989 University of Twente.Google Scholar
Goodman, L. A. (1968). The analysis of cross-classified data: Independence, quasi-independence, and interactions in contingency tables with or without missing entries. Journal of the American Statistical Association, 63, 10911131.Google Scholar
Haberman, S.J.. (2007). The interaction model. Multivariate and mixture distribution Rasch models,von Davier, M, Carstensen, C.H. (Eds.), Springer 201216.CrossRefGoogle Scholar
Haberman, S. J. (2013). A general program for item-response analysis that employs the stabilized Newton–Raphson algorithm (ETS Research Report RR-13-32). https://doi.org/10.1002/j.2333-8504.2013.tb02339.x.CrossRefGoogle Scholar
Haberman, S.J., Sinharay, S. (2013). Generalized residuals for general models for contingency tables with application to item response theory. Journal of the American Statistical Association, 108, 14351444.CrossRefGoogle Scholar
Haberman, S.J., Sinharay, S, Chon, K.H.. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78, 417440.CrossRefGoogle ScholarPubMed
Haberman, S.J., von Davier, A.A.. (2014). Considerations on parameter estimation, scoring, and linking in multistage testing. Computerized multistage testing: Theory and applications, Yan, D, von Davier, A.A., Lewis, C (Eds.), CRC Press 229248.Google Scholar
Houts, C.R., Cai, LflexMIRT: user manual version 3.5: Flexible multilevel multidimensional item analysis and test scoring 2016 Vector Psychometric Group.Google Scholar
Ip, E.H.. (2002). Locally dependent latent trait model and the Dutch identity revisited. Psychometrika, 67, 367386.CrossRefGoogle Scholar
Jewsbury, P.A., van Rijn, P.W.. (2020). IRT and MIRT models for item parameter estimation with multidimensional multistage tests. Journal of Educational and Behavioral Statistics, 45, 383402.CrossRefGoogle Scholar
Joe, H, Maydeu-Olivares, A. (2010). A general family of limited information goodnessof- fit statistics for multinomial data. Psychometrika, 75, 393419.CrossRefGoogle Scholar
Johnson, E.G.. (1992). The design of the national assessment of educational progress. Journal of Educational Measurement, 29 295110.CrossRefGoogle Scholar
Kelderman, H, Rijkes, CPM. (1994). Loglinear multidimensional IRT models for polytomously scored items. Psychometrika, 59 2149176.CrossRefGoogle Scholar
Kolen, M, Brennan, RTest equating, scaling, and linking: Methods and practices 2004 Springer.CrossRefGoogle Scholar
Liu, Y, Maydeu-Olivares, A. (2013). Local dependence diagnostics in IRT modeling of binary data. Educational and Psychological Measurement, 73, 254274.CrossRefGoogle Scholar
Lord, F.M., Novick, M.R.Statistical theories of mental test scores 1968 Addison- Wesley.Google Scholar
Lord, F.M., Wingersky, M.S.. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8, 453461.CrossRefGoogle Scholar
Louis, T. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 44, 226233.CrossRefGoogle Scholar
Maydeu-Olivares, A, Joe, H. (2005). Limited- and full-information estimation and goodness-of-fit testing in 2n contingency tables: A unified framework. Journal of the American Statistical Association, 100, 10091020.CrossRefGoogle Scholar
McDonald, R.P.Test theory: A unified treatment 1999 Lawrence Erlbaum.Google Scholar
Messick, S., Beaton, A., & Lord, F. (1983). National assessment of educational progress reconsidered: A new design for a new era (Tech. Rep.).Google Scholar
Mislevy, R.J., Chang, H.H.. (2000). Does adaptive testing violate local independence?. Psychometrika, 65 2149156.CrossRefGoogle Scholar
Mislevy, R.J., Wu, P-KMissing responses and IRT ability estimation: Omits, choice, time limits, and adaptive testing (ETS Research Report RR-96-30) 1996 Educational Testing Service.Google Scholar
Monseur, C., Baye, A., Lafontaine, D., & Quittre, V. (2011). PISA test format assessment and the local independence assumption. IERI Monographs Series: Issues and Methodologies in Large-Scale Assessments, 4.Google Scholar
Naylor, J.C., Smith, AFM. (1982). Applications of a method for efficient computation of posterior distributions. Applied Statistics, 31, 214225.CrossRefGoogle Scholar
Nikoloulopoulos, A.K., Joe, H. (2015). Factor copula models for item response data. Psychometrika, 80 1126150.CrossRefGoogle ScholarPubMed
Pommerich, M, Segall, D.O.. (2008). Local dependence in an operational CAT: Diagnosis and implications. Journal of Educational Measurement, 45 3201223.CrossRefGoogle Scholar
R Core Team. (2019). R: A language and environment for statistical computing [Computer software manual]. Retrieved from https://www.R-project.org/.Google Scholar
Reckase, M.D.Multidimensional item response theory 2009 Springer.CrossRefGoogle Scholar
Reiser, M. (1996). Analysis of residuals for the multinomial item response model. Psychometrika, 61, 509528.CrossRefGoogle Scholar
Robin, F, Steffen, M, Liang, L. (2014). The multistage test implementation of the GRE revised general test. Computerized multistage testing: Theory and applications, Yan, D, von Davier, A.A., Lewis, C (Eds.), CRC Press 325341.Google Scholar
Rubin, D.B.. (1976). Inference and missing data. Biometrika, 63 3581592.CrossRefGoogle Scholar
Tjur, T. (1982). A connection between Rasch’s item analysis model and a multiplicative poisson model. Scandinavian Journal of Statistics, 9, 2330.Google Scholar
van Rijn, P.W., Sinharay, S, Haberman, S.J., Johnson, M.S.. (2016). Assessment of fit of item response theory models used in large-Scale educational survey assessments. Large-scale Assessments in Education, .CrossRefGoogle Scholar
Verhelst, N.D., Verstralen, HHFM. (2008). Some considerations on the partial credit model. Psicologica, 29, 229254.Google Scholar
von Davier, M, Yamamoto, K, Shin, H.J., Chen, H, Khorramdel, L, Weeks, J, Kandathil, M. (2019). Evaluating item response theory linking and model fit for data from PISA 20002012. Assessment in Education: Principles, Policy & Practice, 26 4466488.Google Scholar
Wainer, H, Bradlow, E, Wang, XTestlet response theory and its applications 2007 Cambridge University Press.CrossRefGoogle Scholar
Wainer, H, Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability?. Educational Measurement: Issues and Practice, 15 12229.CrossRefGoogle Scholar
Warm, T.A.. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427450.CrossRefGoogle Scholar
Woods, C.M.. (2015). Estimating the latent density in unidimensional IRT to permit nonnormality. Handbook of item response theory modeling: Applications to typical performance assessment, Reise, S.P., Revicki, D.A. (Eds.), Routledge 6084.Google Scholar
Yamamoto, K., Shin, H. J., & Khorramdel, L. (2019). Introduction of multistage adaptive testing design in PISA 2018 (OECD Education Working Papers No. 209). https://doi.org/10.1787/b9435d4b-en.CrossRefGoogle Scholar
Yen, W.M.. (1984). Effect of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8 2125145.CrossRefGoogle Scholar
Yen, W.M.. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187213.CrossRefGoogle Scholar
Zenisky, A. L., Hambleton, R. K., & Sireci, S. G. (2001). Effects of local item dependence on the validity of IRT item, test, and ability statistics. (MCAT-5). https://doi.org/10.1002/j.2333-8504.2006.tb02009.x.CrossRefGoogle Scholar
Zhang, J. (2013). A procedure for dimensionality analyses of response data from various test designs. Psychometrika, 78 13758.CrossRefGoogle ScholarPubMed
Zwitser, R.J., Maris, G. (2015). Conditional statistical inference with multistage testing designs. Psychometrika, 80 16584.CrossRefGoogle ScholarPubMed