Hostname: page-component-745bb68f8f-v2bm5 Total loading time: 0 Render date: 2025-01-11T17:44:43.310Z Has data issue: false hasContentIssue false

Bootstrap-Calibrated Interval Estimates for Latent Variable Scores in Item Response Theory

Published online by Cambridge University Press:  01 January 2025

Yang Liu*
Affiliation:
Department of Human Development and Quantitative Methodology, University of Maryland
Ji Seung Yang
Affiliation:
Department of Human Development and Quantitative Methodology, University of Maryland
*
Correspondence should be made to Yang Liu, Department of Human Development and Quantitative Methodology, University of Maryland, 1230B Benjamin Building, College Park, MD 20742 USA. Email: yliu87@umd.edu

Abstract

In most item response theory applications, model parameters need to be first calibrated from sample data. Latent variable (LV) scores calculated using estimated parameters are thus subject to sampling error inherited from the calibration stage. In this article, we propose a resampling-based method, namely bootstrap calibration (BC), to reduce the impact of the carryover sampling error on the interval estimates of LV scores. BC modifies the quantile of the plug-in posterior, i.e., the posterior distribution of the LV evaluated at the estimated model parameters, to better match the corresponding quantile of the true posterior, i.e., the posterior distribution evaluated at the true model parameters, over repeated sampling of calibration data. Furthermore, to achieve better coverage of the fixed true LV score, we explore the use of BC in conjunction with Jeffreys’ prior. We investigate the finite-sample performance of BC via Monte Carlo simulations and apply it to two empirical data examples.

Type
Original Paper
Copyright
Copyright © 2017 The Psychometric Society

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Albert, J.H., (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling, Journal of Educational and Behavioral Statistics, 17(3), 251269.CrossRefGoogle Scholar
Baker, F.B., Kim, S-H, (2004). Item response theory: Parameter estimation techniques, Boca Raton, FL:CRC Press.CrossRefGoogle Scholar
Barndorff-Nielsen, O.E., Cox, D.R., (1996). Prediction and asymptotics, Bernoulli, 2(4), 319340.CrossRefGoogle Scholar
Bartholomew, D. J., & Knott, M., (1999). Latent variable models and factor analysis. London: Edward Arnold (Kendall’s Library of Statistics 7).Google Scholar
Beran, R., (1990). Calibrating prediction regions, Journal of the American Statistical Association, 85(411), 715723.CrossRefGoogle Scholar
Birch, M.W., (1964). A new proof of the pearson-fisher theorem, The Annals of Mathematical Statistics, 35(2), 817824.CrossRefGoogle Scholar
Birnbaum, A., (1968). Some latent train models and their use in inferring an examinee’s ability. In Lord, F.M., Novick, M.R.(Eds.), Statistical theories of mental test scores, (pp 395479). Reading, MA:Addison-Wesley.Google Scholar
Bishop, Y., Fienberg, S., Holland, P., (1975). Discrete multivariate analysis: Theory and practice. Cambridge, MA:The MIT Press.Google Scholar
Bock, R. D., & Aitkin, M., (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443459.CrossRefGoogle Scholar
Bock, R.D., Lieberman, M., (1970). Fitting a response model for n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document} dichotomously scored items. Psychometrika. 35(2), 179–197.Google Scholar
Bock, R.D., Mislevy, R.J., (1982). Adaptive eap estimation of ability in a microcomputer environment, Applied Psychological Measurement, 6(4), 431444.CrossRefGoogle Scholar
Bolt, D.M., (2005). Limited and full-information IRT estimation. In Maydeu-Olivares, A., McArdle, J. (Eds.), Contemporary psychometrics. (pp 2771). New Jersey:Lawrence-Erlbaum.Google Scholar
Brent, R.P., (1973). Some efficient algorithms for solving systems of nonlinear equations, SIAM Journal on Numerical Analysis, 10(2), 327344.CrossRefGoogle Scholar
Brown, L.D., Cai, T.T., DasGupta, A., (2001). Interval estimation for a binomial proportion, Statistical Science, 16(2), 101117.CrossRefGoogle Scholar
Brown, L.D., Cai, T.T., Dasgupta, A., (2002). Confidence intervals for a binomial proportion and asymptotic expansions, Annals of Statistics, 30(1), 160201.CrossRefGoogle Scholar
Cai, L., Thissen, D., & du Toit, S. H. C., (2011). IRTPRO for windows. Lincolnwood, IL: Scientific Software International.Google Scholar
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M., (2006). Measurement error in nonlinear models: A modern perspective. Boca-Raton, FL: CRC press.CrossRefGoogle Scholar
Chalmers, R. P., (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 129. Retrieved from http://www.jstatsoft.org/v48/i06/.CrossRefGoogle Scholar
Chang, H-H, Stout, W., (1993). The asymptotic posterior normality of the latent trait in an IRT model, Psychometrika, 58(1), 3752.CrossRefGoogle Scholar
Cheng, Y., Yuan, K-H, (2010). The impact of fallible item parameter estimates on latent trait recovery, Psychometrika, 75(2), 2802912976519.CrossRefGoogle ScholarPubMed
Chernoff, H., (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations, The Annals of Mathematical Statistics, 23(4), 493507.CrossRefGoogle Scholar
Cox, C., (1984). An elementary introduction to maximum likelihood estimation for multinomial models: Birch’s theorem and the delta method, The American Statistician, 38(4), 283287.CrossRefGoogle Scholar
Cox, D.R., Snell, E.J., (1968). A general definition of residuals, Journal of the Royal Statistical Society: Series B (Methodological), 30(2), 248275.CrossRefGoogle Scholar
Curran, P.J., Hussong, A.M., (2009). Integrative data analysis: The simultaneous analysis of multiple data sets, Psychological Methods, 14(2), 811002777640.CrossRefGoogle ScholarPubMed
Datta, G.S., Mukerjee, R., (2004). Probability matching priors: Higher order asymptotics. New York:Springer.CrossRefGoogle Scholar
Efron, B., Tibshirani, R.J., (1994). An introduction to the bootstrap. Boca Raton, FL:CRC Press.CrossRefGoogle Scholar
Fonseca, G., Giummolè, F., Vidoni, P., (2014). Calibrating predictive distributions, Journal of Statistical Computation and Simulation, 84(2), 373383.CrossRefGoogle Scholar
Haberman, S. J., (2006). Adaptive quadrature for item response models. Technical report no. 06-29, Educational Testing Service, Princeton, NJ.CrossRefGoogle Scholar
Han, K. T., (2012). Fixing the c parameter in the three-parameter logistic model. Practical Assessment, Research & Evaluation. http://pareonline.net/getvn.asp?v=17&n=1.Google Scholar
Hoeffding, W., (1963). Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 58(301), 1330.CrossRefGoogle Scholar
Houts, C.R., Cai, L., (2013). flexMIRT user’s manual version 2: Flexible multilevel multidimensional item analysis and test scoring [Computer software manual]. Chapel Hill, NC:Vector Psychometric Group.Google Scholar
Irwin, D.E., Stucky, B., Langer, M.M., Thissen, D., DeWitt, E.M., Lai, J-S, DeWalt, D.A., (2010). An item response analysis of the pediatric PROMIS anxiety and depressive symptoms scales, Quality of Life Research, 19(4), 5956073158603.CrossRefGoogle ScholarPubMed
Jeffreys, H., (1946). An invariant form for the prior probability in estimation problems. In Proceedings of the royal society of London A: Mathematical, physical and engineering sciences (Vol. 186, pp. 453461).Google ScholarPubMed
Lazarsfeld, P.F., (1950). The logical and mathematical foundation of latent structure analysis. In Stouffer, S.A., Guttman, L., Suchman, E.A., Lazarsfeld, P.F., Star, S.A., Clausen, J.A. (Eds.), Measurement and prediction. (pp 362412). New York:Wiley.Google Scholar
Le Cam, L., Yang, G.L., (2000). Asymptotics in statistics: Some basic concepts. 2 New York:Springer.CrossRefGoogle Scholar
Lehmann, E., Casella, G., (1998). Theory of point estimation. 2 Berlin:Springer.Google Scholar
Liu, Y., Hannig, J., (2016). Generalized fiducial inference for binary logistic item response models, Psychometrika, 81(2), 290324.CrossRefGoogle ScholarPubMed
Liu, Y., & Hannig, J., (2017). Generalized fiducial inference for logistic graded response models. Psychometrika. doi:https://doi.org/10.1007/s11336-017-9554-0.CrossRefGoogle Scholar
Magnus, B.E., Liu, Y., He, J., Quinn, H., Thissen, D., Gross, H.E., Reeve, B.B., (2016). Mode effects between computer self-administration and telephone interviewer-administration of the PROMIS pediatric measures, self-and proxy report, Quality of Life Research, 25, 16551665.CrossRefGoogle ScholarPubMed
McDonald, R.P., (1981). The dimensionality of tests and items, British Journal of Mathematical and Statistical Psychology, 34(1), 100117.CrossRefGoogle Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E., (1953). Equation of state calculations by fast computing machines, The Journal of Chemical Physics, 21(6), 10871092.CrossRefGoogle Scholar
Mislevy, R. J., Wingersky, M., & Sheehan, K. M., (1993). Dealing with uncertainty about item parameters: Expected response functions. Technical report no. 94-28, Educational Testing Service, Princeton, NJ.Google Scholar
Muenks, K., Wigfield, A., Yang, J. S., & O’Neal, C., (2017). How true is grit? Assessing its relations to high school and college students’ personality characteristics, self-regulation, engagement, and achievement. Journal of Educational Psychology, 109(5), 599620.CrossRefGoogle Scholar
Muraki, E., (1992). A generalized partial credit model: Application of an EM algorithm, Applied Psychological Measurement, 16(2), 159176.CrossRefGoogle Scholar
Muthén, B.O., (2002). Beyond SEM: General latent variable modeling, Behaviormetrika, 29(1), 81117.CrossRefGoogle Scholar
Noel, Y., Dauvier, B., (2007). A beta item response model for continuous bounded responses, Applied Psychological Measurement, 31(1), 4773.CrossRefGoogle Scholar
Patton, J.M., Cheng, Y., Yuan, K-H, Diao, Q., (2013). The influence of item calibration error on variable-length computerized adaptive testing, Applied Psychological Measurement, 37, 2440.CrossRefGoogle Scholar
Patton, J.M., Cheng, Y., Yuan, K-H, Diao, Q., (2014). Bootstrap standard errors for maximum likelihood ability estimates when item parameters are unknown, Educational and Psychological Measurement, 74(4), 697712.CrossRefGoogle Scholar
Patz, R.J., Junker, B.W., (1999). A straightforward approach to Markov chain Monte Carlo methods for item response models, Journal of Educational and Behavioral Statistics, 24(2), 146178.CrossRefGoogle Scholar
R Core Team. (2016). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved from https://www.R-project.org/..Google Scholar
Rizopoulos, D., (2006). ltm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software, 17(5), 125.CrossRefGoogle Scholar
Rousseau, J., (2000). Coverage properties of one-sided intervals in the discrete case and application to matching priors, Annals of the Institute of Statistical Mathematics, 52(1), 2842.CrossRefGoogle Scholar
Rupp, A.A., (2013). A systematic review of the methodology for person fit research in item response theory: Lessons about generalizability of inferences from the design of simulation studies, Psychological Test and Assessment Modeling, 55(1), 338.Google Scholar
Samejima, F., (1969). Estimation of ability using a response pattern of graded scores. In Psychometrika monograph no. 17. Richmond, VA: Psychometric Society.Google Scholar
San Martín, E., (2016). Identification of item response theory models. In van der Linden, W.J. (Eds.), Handbook of item response theory, volume two: Statistical tools. (pp 127150). Boca Raton:CRC Press.Google Scholar
San Martín, E., De Boeck, P., (2015). What do you mean by a difficult item? On the interpretation of the difficulty parameter in a Rasch model. In Millsap, R., Bolt, D., van der Ark, L., Wang, W-C (Eds.), Quantitative psychology research. (pp 114). Berlin:Springer.Google Scholar
San Martín, E., Rolin, J-M, Castro, L.M., (2013). Identification of the 1PL model with guessing parameter: Parametric and semi-parametric results, Psychometrika, 78(2), 341379.CrossRefGoogle ScholarPubMed
Schilling, S., Bock, R.D., (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature, Psychometrika, 70(3), 533555.Google Scholar
Sireci, S.G., Thissen, D., Wainer, H., (1991). On the reliability of testlet-based tests, Journal of Educational Measurement, 28(3), 237247.CrossRefGoogle Scholar
Skrondal, A., Rabe-Hasketh, S., (2004). Generalized latent variable modeling. Boca Raton, FL:Chapman & Hall (Interdisciplinary Statistics Series).CrossRefGoogle Scholar
Thissen, D., Steinberg, L., (2009). Item response theory. In Millsap, R., Maydeu-Olivares, A. (Eds.), The Sage handbook of quantitative methods in psychology. (pp 148177). London:Sage Publications .CrossRefGoogle Scholar
Thissen, D., & Wainer, H., (2001). Test scoring. Mahwah, NJ: Lawrence Erlbaum Associates, Inc..CrossRefGoogle Scholar
Vidoni, P., (1998). A note on modified estimative prediction limits and distributions, Biometrika, 85(4), 949953.CrossRefGoogle Scholar
Vidoni, P., (2009). Improved prediction intervals and distribution functions, Scandinavian Journal of Statistics, 36(4), 735748.CrossRefGoogle Scholar
Welch, B., Peers, H., (1963). On formulae for confidence points based on integrals of weighted likelihoods, Journal of the Royal Statistical Society: Series B (Methodological), 25(2), 318329.CrossRefGoogle Scholar
Woods, C.M., Thissen, D., (2006). Item response theory with estimation of the latent population distribution using spline-based densities, Psychometrika, 71(2), 281301.CrossRefGoogle ScholarPubMed
Wood, R., Wilson, D. T., Gibbons, R. D., Schilling, S. G., Muraki, E., & Bock, R. D., (2003). TESTFACT 4 for windows: Test scoring, item statistics, and full-information item factor analysis [Computer software]. Lincolnwood, IL: Scientific Software International.Google Scholar
Yang, J.S., Hansen, M., Cai, L., (2012). Characterizing sources of uncertainty in item response theory scale scores, Educational and Psychological Measurement, 72(2), 264290.CrossRefGoogle Scholar