We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Differential item functioning (DIF) analysis is an important step in establishing the validity of measurements. Most traditional methods for DIF analysis use an item-by-item strategy via anchor items that are assumed DIF-free. If anchor items are flawed, these methods will yield misleading results due to biased scales. In this article, based on the fact that the item’s relative change of difficulty difference (RCD) does not depend on the mean ability of individual groups, a new DIF detection method (RCD-DIF) is proposed by comparing the observed differences against those with simulated data that are known DIF-free. The RCD-DIF method consists of a D-QQ (quantile quantile) plot that permits the identification of internal references points (similar to anchor items), a RCD-QQ plot that facilitates visual examination of DIF, and a RCD graphical test that synchronizes DIF analysis at the test level with that at the item level via confidence intervals on individual items. The RCD procedure visually reveals the overall pattern of DIF in the test and the size of DIF for each item and is expected to work properly even when the majority of the items possess DIF and the DIF pattern is unbalanced. Results of two simulation studies indicate that the RCD graphical test has Type I error rate comparable to those of existing methods but with greater power.
Several methods used to examine differential item functioning (DIF) in Patient-Reported Outcomes Measurement Information System (PROMIS®) measures are presented, including effect size estimation. A summary of factors that may affect DIF detection and challenges encountered in PROMIS DIF analyses, e.g., anchor item selection, is provided. An issue in PROMIS was the potential for inadequately modeled multidimensionality to result in false DIF detection. Section 1 is a presentation of the unidimensional models used by most PROMIS investigators for DIF detection, as well as their multidimensional expansions. Section 2 is an illustration that builds on previous unidimensional analyses of depression and anxiety short-forms to examine DIF detection using a multidimensional item response theory (MIRT) model. The Item Response Theory-Log-likelihood Ratio Test (IRT-LRT) method was used for a real data illustration with gender as the grouping variable. The IRT-LRT DIF detection method is a flexible approach to handle group differences in trait distributions, known as impact in the DIF literature, and was studied with both real data and in simulations to compare the performance of the IRT-LRT method within the unidimensional IRT (UIRT) and MIRT contexts. Additionally, different effect size measures were compared for the data presented in Section 2. A finding from the real data illustration was that using the IRT-LRT method within a MIRT context resulted in more flagged items as compared to using the IRT-LRT method within a UIRT context. The simulations provided some evidence that while unidimensional and multidimensional approaches were similar in terms of Type I error rates, power for DIF detection was greater for the multidimensional approach. Effect size measures presented in Section 1 and applied in Section 2 varied in terms of estimation methods, choice of density function, methods of equating, and anchor item selection. Despite these differences, there was considerable consistency in results, especially for the items showing the largest values. Future work is needed to examine DIF detection in the context of polytomous, multidimensional data. PROMIS standards included incorporation of effect size measures in determining salient DIF. Integrated methods for examining effect size measures in the context of IRT-based DIF detection procedures are still in early stages of development.
The multiple-group categorical factor analysis (FA) model and the graded response model (GRM) are commonly used to examine polytomous items for differential item functioning to detect possible measurement bias in educational testing. In this study, the multiple-group categorical factor analysis model (MC-FA) and multiple-group normal-ogive GRM models are unified under the common framework of discretization of a normal variant. We rigorously justify a set of identified parameters and determine possible identifiability constraints necessary to make the parameters just-identified and estimable in the common framework of MC-FA. By doing so, the difference between categorical FA model and normal-ogive GRM is simply the use of two different sets of identifiability constraints, rather than the seeming distinction between categorical FA and GRM. Thus, we compare the performance on DIF assessment between the categorical FA and GRM approaches through simulation studies on the MC-FA models with their corresponding particular sets of identifiability constraints. Our results show that, under the scenarios with varying degrees of DIF for examinees of different ability levels, models with the GRM type of identifiability constraints generally perform better on DIF detection with a higher testing power. General guidelines regarding the choice of just-identified parameterization are also provided for practical use.
The paper surveys 15 years of progress in three psychometric research areas: latent dimensionality structure, test fairness, and skills diagnosis of educational tests. It is proposed that one effective model for selecting and carrying out research is to chose one's research questions from practical challenges facing educational testing, then bring to bear sophisticated probability modeling and statistical analyses to solve these questions, and finally to make effectiveness of the research answers in meeting the educational testing challenges be the ultimate criterion for judging the value of the research. The problem-solving power and the joy of working with a dedicated, focused, and collegial group of colleagues is emphasized. Finally, it is suggested that the summative assessment testing paradigm that has driven test measurement research for over half a century is giving way to a new paradigm that in addition embraces skills level formative assessment, opening up a plethora of challenging, exciting, and societally important research problems for psychometricians.
Measurement invariance is a fundamental assumption in item response theory models, where the relationship between a latent construct (ability) and observed item responses is of interest. Violation of this assumption would render the scale misinterpreted or cause systematic bias against certain groups of persons. While a number of methods have been proposed to detect measurement invariance violations, they typically require advance definition of problematic item parameters and respondent grouping information. However, these pieces of information are typically unknown in practice. As an alternative, this paper focuses on a family of recently proposed tests based on stochastic processes of casewise derivatives of the likelihood function (i.e., scores). These score-based tests only require estimation of the null model (when measurement invariance is assumed to hold), and they have been previously applied in factor-analytic, continuous data contexts as well as in models of the Rasch family. In this paper, we aim to extend these tests to two-parameter item response models, with strong emphasis on pairwise maximum likelihood. The tests’ theoretical background and implementation are detailed, and the tests’ abilities to identify problematic item parameters are studied via simulation. An empirical example illustrating the tests’ use in practice is also provided.
In response to the target article by Teresi et al. (2021), we explain why the article is useful and we also present a different approach. An alternative category of differential item functioning (DIF) is presented with a corresponding way of modeling DIF, based on random person and random item effects and explanatory covariates.
The PARELLA model is a probabilistic parallelogram model that can be used for the measurement of latent attitudes or latent preferences. The data analyzed are the dichotomous responses of persons to items, with a one (zero) indicating agreement (disagreement) with the content of the item. The model provides a unidimensional representation of persons and items. The response probabilities are a function of the distance between person and item: the smaller the distance, the larger the probability that a person will agree with the content of the item. This paper discusses how the approach to differential item functioning presented by Thissen, Steinberg, and Wainer can be implemented for the PARELLA model.
In behavioural sciences, local dependence and DIF are common, and purification procedures that eliminate items with these weaknesses often result in short scales with poor reliability. Graphical loglinear Rasch models (Kreiner & Christensen, in Statistical Methods for Quality of Life Studies, ed. by M. Mesbah, F.C. Cole & M.T. Lee, Kluwer Academic, pp. 187–203, 2002) where uniform DIF and uniform local dependence are permitted solve this dilemma by modelling the local dependence and DIF. Identifying loglinear Rasch models by a stepwise model search is often very time consuming, since the initial item analysis may disclose a great deal of spurious and misleading evidence of DIF and local dependence that has to disposed of during the modelling procedure.
Like graphical models, graphical loglinear Rasch models possess Markov properties that are useful during the statistical analysis if they are used methodically. This paper describes how. It contains a systematic study of the Markov properties and the way they can be used to distinguish spurious from genuine evidence of DIF and local dependence and proposes a strategy for initial item screening that will reduce the time needed to identify a graphical loglinear Rasch model that fits the item responses. The last part of the paper illustrates the item screening procedure on simulated data and on data on the PF subscale measuring physical functioning in the SF36 Health Survey inventory.
A new diagnostic tool for the identification of differential item functioning (DIF) is proposed. Classical approaches to DIF allow to consider only few subpopulations like ethnic groups when investigating if the solution of items depends on the membership to a subpopulation. We propose an explicit model for differential item functioning that includes a set of variables, containing metric as well as categorical components, as potential candidates for inducing DIF. The ability to include a set of covariates entails that the model contains a large number of parameters. Regularized estimators, in particular penalized maximum likelihood estimators, are used to solve the estimation problem and to identify the items that induce DIF. It is shown that the method is able to detect items with DIF. Simulations and two applications demonstrate the applicability of the method.
This paper proposes a model-based family of detection and quantification statistics to evaluate response bias in item bundles of any size. Compensatory (CDRF) and non-compensatory (NCDRF) response bias measures are proposed, along with their sample realizations and large-sample variability when models are fitted using multiple-group estimation. Based on the underlying connection to item response theory estimation methodology, it is argued that these new statistics provide a powerful and flexible approach to studying response bias for categorical response data over and above methods that have previously appeared in the literature. To evaluate their practical utility, CDRF and NCDRF are compared to the closely related SIBTEST family of statistics and likelihood-based detection methods through a series of Monte Carlo simulations. Results indicate that the new statistics are more optimal effect size estimates of marginal response bias than the SIBTEST family, are competitive with a selection of likelihood-based methods when studying item-level bias, and are the most optimal when studying differential bundle and test bias.
We present a class of finite mixture multilevel multidimensional ordinal IRT models for large scale cross-cultural research. Our model is proposed for confirmatory research settings. Our prior for item parameters is a mixture distribution to accommodate situations where different groups of countries have different measurement operations, while countries within these groups are still allowed to be heterogeneous. A simulation study is conducted that shows that all parameters can be recovered. We also apply the model to real data on the two components of affective subjective well-being: positive affect and negative affect. The psychometric behavior of these two scales is studied in 28 countries across four continents.
Establishing the invariance property of an instrument (e.g., a questionnaire or test) is a key step for establishing its measurement validity. Measurement invariance is typically assessed by differential item functioning (DIF) analysis, i.e., detecting DIF items whose response distribution depends not only on the latent trait measured by the instrument but also on the group membership. DIF analysis is confounded by the group difference in the latent trait distributions. Many DIF analyses require knowing several anchor items that are DIF-free in order to draw inferences on whether each of the rest is a DIF item, where the anchor items are used to identify the latent trait distributions. When no prior information on anchor items is available, or some anchor items are misspecified, item purification methods and regularized estimation methods can be used. The former iteratively purifies the anchor set by a stepwise model selection procedure, and the latter selects the DIF-free items by a LASSO-type regularization approach. Unfortunately, unlike the methods based on a correctly specified anchor set, these methods are not guaranteed to provide valid statistical inference (e.g., confidence intervals and p-values). In this paper, we propose a new method for DIF analysis under a multiple indicators and multiple causes (MIMIC) model for DIF. This method adopts a minimal \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} norm condition for identifying the latent trait distributions. Without requiring prior knowledge about an anchor set, it can accurately estimate the DIF effects of individual items and further draw valid statistical inferences for quantifying the uncertainty. Specifically, the inference results allow us to control the type-I error for DIF detection, which may not be possible with item purification and regularized estimation methods. We conduct simulation studies to evaluate the performance of the proposed method and compare it with the anchor-set-based likelihood ratio test approach and the LASSO approach. The proposed method is applied to analysing the three personality scales of the Eysenck personality questionnaire-revised (EPQ-R).
This paper addresses methodological issues that concern the scaling model used in the international comparison of student attainment in the Programme for International Student Attainment (PISA), specifically with reference to whether PISA’s ranking of countries is confounded by model misfit and differential item functioning (DIF). To determine this, we reanalyzed the publicly accessible data on reading skills from the 2006 PISA survey. We also examined whether the ranking of countries is robust in relation to the errors of the scaling model. This was done by studying invariance across subscales, and by comparing ranks based on the scaling model and ranks based on models where some of the flaws of PISA’s scaling model are taken into account. Our analyses provide strong evidence of misfit of the PISA scaling model and very strong evidence of DIF. These findings do not support the claims that the country rankings reported by PISA are robust.
This paper proposes a method for assessing differential item functioning (DIF) in item response theory (IRT) models. The method does not require pre-specification of anchor items, which is its main virtue. It is developed in two main steps: first by showing how DIF can be re-formulated as a problem of outlier detection in IRT-based scaling and then tackling the latter using methods from robust statistics. The proposal is a redescending M-estimator of IRT scaling parameters that is tuned to flag items with DIF at the desired asymptotic type I error rate. Theoretical results describe the efficiency of the estimator in the absence of DIF and its robustness in the presence of DIF. Simulation studies show that the proposed method compares favorably to currently available approaches for DIF detection, and a real data example illustrates its application in a research context where pre-specification of anchor items is infeasible. The focus of the paper is the two-parameter logistic model in two independent groups, with extensions to other settings considered in the conclusion.
This paper presents a systematic investigation of how affirmative and polar-opposite items presented either jointly or separately affect yea-saying tendencies. We measure these yea-saying tendencies with item response models that estimate a respondent’s tendency to give a “yea”-response that may be unrelated to the target trait. In a re-analysis of the Zhang et al. (PLoS ONE, 11:1–15, 2016) data, we find that yea-saying tendencies depend on whether items are presented as part of a scale that contains affirmative and/or polar-opposite items. Yea-saying tendencies are stronger for affirmative than for polar-opposite items. Moreover, presenting polar-opposite items together with affirmative items creates lower yea-saying tendencies for polar-opposite items than when presented in isolation. IRT models that do not account for these yea-saying effects arrive at a two-dimensional representation of the target trait. These findings demonstrate that the contextual information provided by an item scale can serve as a determinant of differential item functioning.
Differential item functioning (DIF) is a standard analysis for every testing company. Research has demonstrated that DIF can result when test items measure different ability composites, and the groups being examined for DIF exhibit distinct underlying ability distributions on those composite abilities. In this article, we examine DIF from a two-dimensional multidimensional item response theory (MIRT) perspective. We begin by delving into the compensatory MIRT model, illustrating and how items and the composites they measure can be graphically represented. Additionally, we discuss how estimated item parameters can vary based on the underlying latent ability distributions of the examinees. Analytical research highlighting the consequences of ignoring dimensionally and applying unidimensional IRT models, where the two-dimensional latent space is mapped onto a unidimensional, is reviewed. Next, we investigate three different approaches to understanding DIF from a MIRT standpoint: 1. Analytically Uniform and Nonuniform DIF: When two groups of interest have different two-dimensional ability distributions, a unidimensional model is estimated. 2. Accounting for complete latent ability space: We emphasize the importance of considering the entire latent ability space when using DIF conditional approaches, which leads to the mitigation of DIF effects. 3. Scenario-Based DIF: Even when underlying two-dimensional distributions are identical for two groups, differing problem-solving approaches can still lead to DIF. Modern software programs facilitate routine DIF procedures for comparing response data from two identified groups of interest. The real challenge is to identify why DIF could occur with flagged items. Thus, as a closing challenge, we present four items (Appendix A) from a standardized test and invite readers to identify which group was favored by a DIF analysis.
Item response theory (IRT) plays an important role in psychological and educational measurement. Unlike the classical testing theory, IRT models aggregate the item level information, yielding more accurate measurements. Most IRT models assume local independence, an assumption not likely to be satisfied in practice, especially when the number of items is large. Results in the literature and simulation studies in this paper reveal that misspecifying the local independence assumption may result in inaccurate measurements and differential item functioning. To provide more robust measurements, we propose an integrated approach by adding a graphical component to a multidimensional IRT model that can offset the effect of unknown local dependence. The new model contains a confirmatory latent variable component, which measures the targeted latent traits, and a graphical component, which captures the local dependence. An efficient proximal algorithm is proposed for the parameter estimation and structure learning of the local dependence. This approach can substantially improve the measurement, given no prior information on the local dependence structure. The model can be applied to measure both a unidimensional latent trait and multidimensional latent traits.
This paper discusses the issue of differential item functioning (DIF) in international surveys. DIF is likely to occur in international surveys. What is needed is a statistical approach that takes DIF into account, while at the same time allowing for meaningful comparisons between countries. Some existing approaches are discussed and an alternative is provided. The core of this alternative approach is to define the construct as a large set of items, and to report in terms of summary statistics. Since the data are incomplete, measurement models are used to complete the incomplete data. For that purpose, different models can be used across countries. The method is illustrated with PISA’s reading literacy data. The results indicate that this approach fits the data better than the current PISA methodology; however, the league tables are nearly identical. The implications for monitoring changes over time are discussed.
In multidimensional tests, the identification of latent traits measured by each item is crucial. In addition to item–trait relationship, differential item functioning (DIF) is routinely evaluated to ensure valid comparison among different groups. The two problems are investigated separately in the literature. This paper uses a unified framework for detecting item–trait relationship and DIF in multidimensional item response theory (MIRT) models. By incorporating DIF effects in MIRT models, these problems can be considered as variable selection for latent/observed variables and their interactions. A Bayesian adaptive Lasso procedure is developed for variable selection, in which item–trait relationship and DIF effects can be obtained simultaneously. Simulation studies show the performance of our method for parameter estimation, the recovery of item–trait relationship and the detection of DIF effects. An application is presented using data from the Eysenck Personality Questionnaire.
Differential item functioning (DIF), referring to between-group variation in item characteristics above and beyond the group-level disparity in the latent variable of interest, has long been regarded as an important item-level diagnostic. The presence of DIF impairs the fit of the single-group item response model being used, and calls for either model modification or item deletion in practice, depending on the mode of analysis. Methods for testing DIF with continuous covariates, rather than categorical grouping variables, have been developed; however, they are restrictive in parametric forms, and thus are not sufficiently flexible to describe complex interaction among latent variables and covariates. In the current study, we formulate the probability of endorsing each test item as a general bivariate function of a unidimensional latent trait and a single covariate, which is then approximated by a two-dimensional smoothing spline. The accuracy and precision of the proposed procedure is evaluated via Monte Carlo simulations. If anchor items are available, we proposed an extended model that simultaneously estimates item characteristic functions (ICFs) for anchor items, ICFs conditional on the covariate for non-anchor items, and the latent variable density conditional on the covariate—all using regression splines. A permutation DIF test is developed, and its performance is compared to the conventional parametric approach in a simulation study. We also illustrate the proposed semiparametric DIF testing procedure with an empirical example.