We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
When multiple items are clustered around a reading passage, the local independence assumption in item response theory is often violated. The amount of information contained in an item cluster is usually overestimated if violation of local independence is ignored and items are treated as locally independent when in fact they are not. In this article we provide a general method that adjusts for the inflation of information associated with a test containing item clusters. A computational scheme was presented for the evaluation of the factor of adjustment for clusters in the restrictive case of two items per cluster, and the general case of more than two items per cluster. The methodology was motivated by a study of the NAEP Reading Assessment. We present a simulated study along with an analysis of a NAEP data set.
In this note, we prove that the 3 parameter logistic model with fixed-effect abilities is identified only up to a linear transformation of the ability scale under mild regularity conditions, contrary to the claims in Theorem 2 of San Martín et al. (Psychometrika, 80(2):450–467, 2015a).
This study proposes a new item parameter linking method for the common-item nonequivalent groups design in item response theory (IRT). Previous studies assumed that examinees are randomly assigned to either test form. However, examinees can frequently select their own test forms and tests often differ according to examinees’ abilities. In such cases, concurrent calibration or multiple group IRT modeling without modeling test form selection behavior can yield severely biased results. We proposed a model wherein test form selection behavior depends on test scores and used a Monte Carlo expectation maximization (MCEM) algorithm. This method provided adequate estimates of testing parameters.
The relationship between the EM algorithm and the Bock-Aitkin procedure is described with a continuous distribution of ability (latent trait) from an EM-algorithm perspective. Previous work has been restricted to the discrete case from a probit-analysis perspective.
When latent variables are used as outcomes in regression analysis, a common approach that is used to solve the ignored measurement error issue is to take a multilevel perspective on item response modeling (IRT). Although recent computational advancement allows efficient and accurate estimation of multilevel IRT models, we argue that a two-stage divide-and-conquer strategy still has its unique advantages. Within the two-stage framework, three methods that take into account heteroscedastic measurement errors of the dependent variable in stage II analysis are introduced; they are the closed-form marginal MLE, the expectation maximization algorithm, and the moment estimation method. They are compared to the naïve two-stage estimation and the one-stage MCMC estimation. A simulation study is conducted to compare the five methods in terms of model parameter recovery and their standard error estimation. The pros and cons of each method are also discussed to provide guidelines for practitioners. Finally, a real data example is given to illustrate the applications of various methods using the National Educational Longitudinal Survey data (NELS 88).
This paper presents a machine learning approach to multidimensional item response theory (MIRT), a class of latent factor models that can be used to model and predict student performance from observed assessment data. Inspired by collaborative filtering, we define a general class of models that includes many MIRT models. We discuss the use of penalized joint maximum likelihood to estimate individual models and cross-validation to select the best performing model. This model evaluation process can be optimized using batching techniques, such that even sparse large-scale data can be analyzed efficiently. We illustrate our approach with simulated and real data, including an example from a massive open online course. The high-dimensional model fit to this large and sparse dataset does not lend itself well to traditional methods of factor interpretation. By analogy to recommender-system applications, we propose an alternative “validation” of the factor model, using auxiliary information about the popularity of items consulted during an open-book examination in the course.
This paper presents a noncompensatory latent trait model, the multicomponent latent trait model for diagnosis (MLTM-D), for cognitive diagnosis. In MLTM-D, a hierarchical relationship between components and attributes is specified to be applicable to permit diagnosis at two levels. MLTM-D is a generalization of the multicomponent latent trait model (MLTM; Whitely in Psychometrika, 45:479–494, 1980; Embretson in Psychometrika, 49:175–186, 1984) to be applicable to measures of broad traits, such as achievement tests, in which component structure varies between items. Conditions for model identification are described and marginal maximum likelihood estimators are presented, along with simulation data to demonstrate parameter recovery. To illustrate how MLTM-D can be used for diagnosis, an application to a large-scale test of mathematics achievement is presented. An advantage of MLTM-D for diagnosis is that it may be more applicable to large-scale assessments with more heterogeneous items than are latent class models.
This commentary is an attempt to present some additional alternatives to the suggestions made by Reise et al. (2021). IRT models as they are used for patient-reported outcome (PRO) scales may not be fully satisfactory when used with commonly made assumptions. The suggested change to an alternative parameterization is critically reflected with the intent to initiate discussion around more comprehensive alternatives that allow for more complex latent structures having the potential to be more appropriate for PRO scales as they are applied to diverse populations.
We introduce an extended multivariate logistic response model for multiple choice items; this model includes several earlier proposals as special cases. The discussion includes a theoretical development of the model, a description of the relationship between the model and data, and a marginal maximum likelihood estimation scheme for the item parameters. Comparisons of the performance of different versions of the full model with more constrained forms corresponding to previous proposals are included, using likelihood ratio statistics and empirical data.
An application of a hierarchical IRT model for items in families generated through the application of different combinations of design rules is discussed. Within the families, the items are assumed to differ only in surface features. The parameters of the model are estimated in a Bayesian framework, using a data-augmented Gibbs sampler. An obvious application of the model is computerized algorithmic item generation. Such algorithms have the potential to increase the cost-effectiveness of item generation as well as the flexibility of item administration. The model is applied to data from a non-verbal intelligence test created using design rules. In addition, results from a simulation study conducted to evaluate parameter recovery are presented.
This article considers the application of the simulation-extrapolation (SIMEX) method for measurement error correction when the error variance is a function of the latent variable being measured. Heteroskedasticity of this form arises in educational and psychological applications with ability estimates from item response theory models. We conclude that there is no simple solution for applying SIMEX that generally will yield consistent estimators in this setting. However, we demonstrate that several approximate SIMEX methods can provide useful estimators, leading to recommendations for analysts dealing with this form of error in settings where SIMEX may be the most practical option.
Test items are often evaluated and compared by contrasting the shapes of their item characteristics curves (ICC's) or surfaces. The current paper develops and applies three general (i.e., nonparametric) comparisons of the shapes of two item characteristic surfaces: (i) proportional latent odds, (ii) uniform relative difficulty, and (iii) item sensitivity. Two items may be compared in these ways while making no assumption about the shapes of item characteristic surfaces for other items, and no assumption about the dimensionality of the latent variable. Also studied is a method for comparing the relative shapes of two item characteristic curves in two examinee populations.
The aim of the research presented here is the use of extensions of longitudinal item response theory (IRT) models in the analysis and comparison of group-specific growth in large-scale assessments of educational outcomes.
A general discrete latent variable model was used to specify and compare two types of multidimensional item-response-theory (MIRT) models for longitudinal data: (a) a model that handles repeated measurements as multiple, correlated variables over time and (b) a model that assumes one common variable over time and additional variables that quantify the change. Using extensions of these MIRT models, we approach the issue of modeling and comparing group-specific growth in observed and unobserved subpopulations. The analyses presented in this paper aim at answering the question whether academic growth is homogeneous across types of schools defined by academic demands and curricular differences. In order to facilitate answering this research question, (a) a model with a single two-dimensional ability distribution was compared to (b) a model assuming multiple populations with potentially different two-dimensional ability distributions based on type of school and to (c) a model that assumes that the observations are sampled from a discrete mixture of (unobserved) populations, allowing for differences across schools with respect to mixing proportions. For this purpose, we specified a hierarchical-mixture distribution variant of the two MIRT models. The latter model, (c), is a growth-mixture MIRT model that allows for variation of the mixing proportions across clusters in a hierarchically organized sample. We applied the proposed models to the PISA-I-Plus data for assessing learning and change across multiple subpopulations. The results of this study support the hypothesis of differential growth.
Assuming a nonparametric family of item response theory models, a theory-based procedure for testing the hypothesis of unidimensionality of the latent space is proposed. The asymptotic distribution of the test statistic is derived assuming unidimensionality, thereby establishing an asymptotically valid statistical test of the unidimensionality of the latent trait. Based upon a new notion of dimensionality, the test is shown to have asymptotic power 1. A 6300 trial Monte Carlo study using published item parameter estimates of widely used standardized tests indicates conservative adherence to the nominal level of significance and statistical power averaging 81 out of 100 rejections for examinee sample sizes and psychological test lengths often incurred in practice.
When surveys contain direct questions about sensitive topics, participants may not provide their true answers. Indirect question techniques incentivize truthful answers by concealing participants’ responses in various ways. The Crosswise Model aims to do this by pairing a sensitive target item with a non-sensitive baseline item, and only asking participants to indicate whether their responses to the two items are the same or different. Selection of the baseline item is crucial to guarantee participants’ perceived and actual privacy and to enable reliable estimates of the sensitive trait. This research makes the following contributions. First, it describes an integrated methodology to select the baseline item, based on conceptual and statistical considerations. The resulting methodology distinguishes four statistical models. Second, it proposes novel Bayesian estimation methods to implement these models. Third, it shows that the new models introduced here improve efficiency over common applications of the Crosswise Model and may relax the required statistical assumptions. These three contributions facilitate applying the methodology in a variety of settings. An empirical application on attitudes toward LGBT issues shows the potential of the Crosswise Model. An interactive app, Python and MATLAB codes support broader adoption of the model.
Change-point analysis (CPA) is a well-established statistical method to detect abrupt changes, if any, in a sequence of data. In this paper, we propose a procedure based on CPA to detect test speededness. This procedure is not only able to classify examinees into speeded and non-speeded groups, but also identify the point at which an examinee starts to speed. Identification of the change point can be very useful. First, it informs decision makers of the appropriate length of a test. Second, by removing the speeded responses, instead of the entire response sequence of an examinee suspected of speededness, ability estimation can be improved. Simulation studies show that this procedure is efficient in detecting both speeded examinees and the speeding point. Ability estimation is dramatically improved by removing speeded responses identified by our procedure. The procedure is then applied to a real dataset for illustration purpose.
We propose a nonparametric item response theory model for dichotomously-scored items in a Bayesian framework. The model is based on a latent class (LC) formulation, and it is multidimensional, with dimensions corresponding to a partition of the items in homogenous groups that are specified on the basis of inequality constraints among the conditional success probabilities given the latent class. Moreover, an innovative system of prior distributions is proposed following the encompassing approach, in which the largest model is the unconstrained LC model. A reversible-jump type algorithm is described for sampling from the joint posterior distribution of the model parameters of the encompassing model. By suitably post-processing its output, we then make inference on the number of dimensions (i.e., number of groups of items measuring the same latent trait) and we cluster items according to the dimensions when unidimensionality is violated. The approach is illustrated by two examples on simulated data and two applications based on educational and quality-of-life data.
Given known item parameters, the bootstrap method can be used to determine the statistical accuracy of ability estimates in item response theory. Through a Monte Carlo study, the method is evaluated as a way of approximating the standard error and confidence limits for the maximum likelihood estimate of the ability parameter, and compared to the use of the theoretical standard error and confidence limits developed by Lord. At least for short tests, the bootstrap method yielded better estimates than the corresponding theoretical values.
Standard procedures for drawing inferences from complex samples do not apply when the variable of interest θ cannot be observed directly, but must be inferred from the values of secondary random variables that depend on θ stochastically. Examples are proficiency variables in item response models and class memberships in latent class models. Rubin's “multiple imputation” techniques yield approximations of sample statistics that would have been obtained, had θ been observable, and associated variance estimates that account for uncertainty due to both the sampling of respondents and the latent nature of θ. The approach is illustrated with data from the National Assessment for Educational Progress.
A formal framework for measuring change in sets of dichotomous data is developed and implications of the principle of specific objectivity of results within this framework are investigated. Building upon the concept of specific objectivity as introduced by G. Rasch, three equivalent formal definitions of that postulate are given, and it is shown that they lead to latent additivity of the parametric structure. If, in addition, the observations are assumed to be locally independent realizations of Bernoulli variables, a family of models follows necessarily which are isomorphic to a logistic model with additive parameters, determining an interval scale for latent trait measurement and a ratio scale for quantifying change. Adding the further assumption of generalizability over subsets of items from a given universe yields a logistic model which allows a multidimensional description of individual differences and a quantitative assessment of treatment effects; as a special case, a unidimensional parameterization is introduced also and a unidimensional latent trait model for change is derived. As a side result, the relationship between specific objectivity and additive conjoint measurement is clarified.