We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In item response theory (IRT), it is often necessary to perform restricted recalibration (RR) of the model: A set of (focal) parameters is estimated holding a set of (nuisance) parameters fixed. Typical applications of RR include expanding an existing item bank, linking multiple test forms, and associating constructs measured by separately calibrated tests. In the current work, we provide full statistical theory for RR of IRT models under the framework of pseudo-maximum likelihood estimation. We describe the standard error calculation for the focal parameters, the assessment of overall goodness-of-fit (GOF) of the model, and the identification of misfitting items. We report a simulation study to evaluate the performance of these methods in the scenario of adding a new item to an existing test. Parameter recovery for the focal parameters as well as Type I error and power of the proposed tests are examined. An empirical example is also included, in which we validate the pediatric fatigue short-form scale in the Patient-Reported Outcome Measurement Information System (PROMIS), compute global and local GOF statistics, and update parameters for the misfitting items.
Penalized factor analysis is an efficient technique that produces a factor loading matrix with many zero elements thanks to the introduction of sparsity-inducing penalties within the estimation process. However, sparse solutions and stable model selection procedures are only possible if the employed penalty is non-differentiable, which poses certain theoretical and computational challenges. This article proposes a general penalized likelihood-based estimation approach for single- and multiple-group factor analysis models. The framework builds upon differentiable approximations of non-differentiable penalties, a theoretically founded definition of degrees of freedom, and an algorithm with integrated automatic multiple tuning parameter selection that exploits second-order analytical derivative information. The proposed approach is evaluated in two simulation studies and illustrated using a real data set. All the necessary routines are integrated into the R package penfa.
A first-order autoregressive growth model is proposed for longitudinal binary item analysis where responses to the same items are conditionally dependent across time given the latent traits. Specifically, the item response probability for a given item at a given time depends on the latent trait as well as the response to the same item at the previous time, or the lagged response. An initial conditions problem arises because there is no lagged response at the initial time period. We handle this problem by adapting solutions proposed for dynamic models in panel data econometrics. Asymptotic and finite sample power for the autoregressive parameters are investigated. The consequences of ignoring local dependence and the initial conditions problem are also examined for data simulated from a first-order autoregressive growth model. The proposed methods are applied to longitudinal data on Korean students’ self-esteem.
The conventional setup for multi-group structural equation modeling requires a stringent condition of cross-group equality of intercepts before mean comparison with latent variables can be conducted. This article proposes a new setup that allows mean comparison without the need to estimate any mean structural model. By projecting the observed sample means onto the space of the common scores and the space orthogonal to that of the common scores, the new setup allows identifying and estimating the means of the common and specific factors, although, without replicate measures, variances of specific factors cannot be distinguished from those of measurement errors. Under the new setup, testing cross-group mean differences of the common scores is done independently from that of the specific factors. Such independent testing eliminates the requirement for cross-group equality of intercepts by the conventional setup in order to test cross-group equality of means of latent variables using chi-square-difference statistics. The most appealing piece of the new setup is a validity index for mean differences, defined as the percentage of the sum of the squared observed mean differences that is due to that of the mean differences of the common scores. By analyzing real data with two groups, the new setup is shown to offer more information than what is obtained under the conventional setup.
Several concepts are introduced and defined: measurement invariance, structural bias, weak measurement invariance, strong factorial invariance, and strict factorial invariance. It is shown that factorial invariance has implications for (weak) measurement invariance. Definitions of fairness in employment/admissions testing and salary equity are provided and it is argued that strict factorial invariance is required for fairness/equity to exist. Implications for item and test bias are developed and it is argued that item or test bias probably depends on the existence of latent variables that are irrelevant to the primary goal of test constructers.
A common assessment research design is the single-group pre-test/post-test design in which examinees are administered an assessment before instruction and then another assessment after instruction. In this type of study, the primary objective is to measure growth in examinees, individually and collectively. In an item response theory (IRT) framework, longitudinal IRT models can be used to assess growth in examinee ability over time. In a diagnostic classification model (DCM) framework, assessing growth translates to measuring changes in attribute mastery status over time, thereby providing a categorical, criterion-referenced interpretation of growth. This study introduces the Transition Diagnostic Classification Model (TDCM), which combines latent transition analysis with the log-linear cognitive diagnosis model to provide methodology for analyzing growth in a general DCM framework. Simulation study results indicate that the proposed model is flexible, provides accurate and reliable classifications, and is quite robust to violations to measurement invariance over time. The TDCM is used to analyze pre-test/post-test data from a diagnostic mathematics assessment.
A multinormal partial credit model for factor analysis of polytomously scored items with ordered response categories is derived using an extension of the Dutch Identity (Holland in Psychometrika 55:5–18, 1990). In the model, latent variables are assumed to have a multivariate normal distribution conditional on unweighted sums of item scores, which are sufficient statistics. Attention is paid to maximum likelihood estimation of item parameters, multivariate moments of latent variables, and person parameters. It is shown that the maximum likelihood estimates can be found without the use of numerical integration techniques. More general models are discussed which can be used for testing the model, and it is shown how models with different numbers of latent variables can be tested against each other. In addition, multi-group extensions are proposed, which can be used for testing both measurement invariance and latent population differences. Models and procedures discussed are demonstrated in an empirical data example.
We present a class of finite mixture multilevel multidimensional ordinal IRT models for large scale cross-cultural research. Our model is proposed for confirmatory research settings. Our prior for item parameters is a mixture distribution to accommodate situations where different groups of countries have different measurement operations, while countries within these groups are still allowed to be heterogeneous. A simulation study is conducted that shows that all parameters can be recovered. We also apply the model to real data on the two components of affective subjective well-being: positive affect and negative affect. The psychometric behavior of these two scales is studied in 28 countries across four continents.
Establishing the invariance property of an instrument (e.g., a questionnaire or test) is a key step for establishing its measurement validity. Measurement invariance is typically assessed by differential item functioning (DIF) analysis, i.e., detecting DIF items whose response distribution depends not only on the latent trait measured by the instrument but also on the group membership. DIF analysis is confounded by the group difference in the latent trait distributions. Many DIF analyses require knowing several anchor items that are DIF-free in order to draw inferences on whether each of the rest is a DIF item, where the anchor items are used to identify the latent trait distributions. When no prior information on anchor items is available, or some anchor items are misspecified, item purification methods and regularized estimation methods can be used. The former iteratively purifies the anchor set by a stepwise model selection procedure, and the latter selects the DIF-free items by a LASSO-type regularization approach. Unfortunately, unlike the methods based on a correctly specified anchor set, these methods are not guaranteed to provide valid statistical inference (e.g., confidence intervals and p-values). In this paper, we propose a new method for DIF analysis under a multiple indicators and multiple causes (MIMIC) model for DIF. This method adopts a minimal \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document} norm condition for identifying the latent trait distributions. Without requiring prior knowledge about an anchor set, it can accurately estimate the DIF effects of individual items and further draw valid statistical inferences for quantifying the uncertainty. Specifically, the inference results allow us to control the type-I error for DIF detection, which may not be possible with item purification and regularized estimation methods. We conduct simulation studies to evaluate the performance of the proposed method and compare it with the anchor-set-based likelihood ratio test approach and the LASSO approach. The proposed method is applied to analysing the three personality scales of the Eysenck personality questionnaire-revised (EPQ-R).
The present research evaluates the stability of self-esteem as assessed by a daily version of the Rosenberg (Society and the adolescent self-image, Princeton University Press, Princeton, 1965) general self-esteem scale (RGSE). The scale was administered to 391 undergraduates for five consecutive days. The longitudinal data were analyzed using the integrated LC-LSTM framework that allowed us to evaluate: (1) the measurement invariance of the RGSE, (2) its stability and change across the 5-day assessment period, (3) the amount of variance attributable to stable and transitory latent factors, and (4) the criterion-related validity of these factors. Results provided evidence for measurement invariance, mean-level stability, and rank-order stability of daily self-esteem. Latent state-trait analyses revealed that variances in scores of the RGSE can be decomposed into six components: stable self-esteem (40 %), ephemeral (or temporal-state) variance (36 %), stable negative method variance (9 %), stable positive method variance (4 %), specific variance (1 %) and random error variance (10 %). Moreover, latent factors associated with daily self-esteem were associated with measures of depression, implicit self-esteem, and grade point average.
Ensuring fairness in instruments like survey questionnaires or educational tests is crucial. One way to address this is by a Differential Item Functioning (DIF) analysis, which examines if different subgroups respond differently to a particular item, controlling for their overall latent construct level. DIF analysis is typically conducted to assess measurement invariance at the item level. Traditional DIF analysis methods require knowing the comparison groups (reference and focal groups) and anchor items (a subset of DIF-free items). Such prior knowledge may not always be available, and psychometric methods have been proposed for DIF analysis when one piece of information is unknown. More specifically, when the comparison groups are unknown while anchor items are known, latent DIF analysis methods have been proposed that estimate the unknown groups by latent classes. When anchor items are unknown while comparison groups are known, methods have also been proposed, typically under a sparsity assumption – the number of DIF items is not too large. However, DIF analysis when both pieces of information are unknown has not received much attention. This paper proposes a general statistical framework under this setting. In the proposed framework, we model the unknown groups by latent classes and introduce item-specific DIF parameters to capture the DIF effects. Assuming the number of DIF items is relatively small, an \documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$L_1$$\end{document}-regularised estimator is proposed to simultaneously identify the latent classes and the DIF items. A computationally efficient Expectation-Maximisation (EM) algorithm is developed to solve the non-smooth optimisation problem for the regularised estimator. The performance of the proposed method is evaluated by simulation studies and an application to item response data from a real-world educational test.
A variety of statistical methods have been suggested for detecting differential item functioning (DIF) in the Rasch model. Most of these methods are designed for the comparison of pre-specified focal and reference groups, such as males and females. Latent class approaches, on the other hand, allow the detection of previously unknown groups exhibiting DIF. However, this approach provides no straightforward interpretation of the groups with respect to person characteristics. Here, we propose a new method for DIF detection based on model-based recursive partitioning that can be considered as a compromise between those two extremes. With this approach it is possible to detect groups of subjects exhibiting DIF, which are not pre-specified, but result from combinations of observed covariates. These groups are directly interpretable and can thus help generate hypotheses about the psychological sources of DIF. The statistical background and construction of the new method are introduced by means of an instructive example, and extensive simulation studies are presented to support and illustrate the statistical properties of the method, which is then applied to empirical data from a general knowledge quiz. A software implementation of the method is freely available in the R system for statistical computing.
The existence of differences in prediction systems involving test scores across demographic groups continues to be a thorny and unresolved scientific, professional, and societal concern. Our case study uses a two-stage least squares (2SLS) estimator to jointly assess measurement invariance and prediction invariance in high-stakes testing. So, we examined differences across groups based on latent as opposed to observed scores with data for 176 colleges and universities from The College Board. Results showed that evidence regarding measurement invariance was rejected for the SAT mathematics (SAT-M) subtest at the 0.01 level for 74.5% and 29.9% of cohorts for Black versus White and Hispanic versus White comparisons, respectively. Also, on average, Black students with the same standing on a common factor had observed SAT-M scores that were nearly a third of a standard deviation lower than for comparable Whites. We also found evidence that group differences in SAT-M measurement intercepts may partly explain the well-known finding of observed differences in prediction intercepts. Additionally, results provided evidence that nearly a quarter of the statistically significant observed intercept differences were not statistically significant at the 0.05 level once predictor measurement error was accounted for using the 2SLS procedure. Our joint measurement and prediction invariance approach based on latent scores opens the door to a new high-stakes testing research agenda whose goal is to not simply assess whether observed group-based differences exist and the size and direction of such differences. Rather, the goal of this research agenda is to assess the causal chain starting with underlying theoretical mechanisms (e.g., contextual factors, differences in latent predictor scores) that affect the size and direction of any observed differences.
The issue of measurement invariance commonly arises in factor-analytic contexts, with methods for assessment including likelihood ratio tests, Lagrange multiplier tests, and Wald tests. These tests all require advance definition of the number of groups, group membership, and offending model parameters. In this paper, we study tests of measurement invariance based on stochastic processes of casewise derivatives of the likelihood function. These tests can be viewed as generalizations of the Lagrange multiplier test, and they are especially useful for: (i) identifying subgroups of individuals that violate measurement invariance along a continuous auxiliary variable without prespecified thresholds, and (ii) identifying specific parameters impacted by measurement invariance violations. The tests are presented and illustrated in detail, including an application to a study of stereotype threat and simulations examining the tests’ abilities in controlled conditions.
Borsboom (Psychometrika, 71:425–440, 2006) noted that recent work on measurement invariance (MI) and predictive invariance (PI) has had little impact on the practice of measurement in psychology. To understand this contention, the definitions of MI and PI are reviewed, followed by results on the consistency between the two forms of invariance in the general case. The special parametric cases of factor analysis (strict factorial invariance) and linear regression analyses (strong regression invariance) are then described, along with findings on the inconsistency between the two forms of invariance in this context. Two numerical examples of inconsistency are reviewed in detail. The impact of violations of MI on accuracy of selection is illustrated. Finally, reasons for the slow dissemination of work on invariance are discussed, and the prospects for altering this situation are weighed.
Researchers are often interested in testing for measurement invariance with respect to an ordinal auxiliary variable such as age group, income class, or school grade. In a factor-analytic context, these tests are traditionally carried out via a likelihood ratio test statistic comparing a model where parameters differ across groups to a model where parameters are equal across groups. This test neglects the fact that the auxiliary variable is ordinal, and it is also known to be overly sensitive at large sample sizes. In this paper, we propose test statistics that explicitly account for the ordinality of the auxiliary variable, resulting in higher power against “monotonic” violations of measurement invariance and lower power against “non-monotonic” ones. The statistics are derived from a family of tests based on stochastic processes that have recently received attention in the psychometric literature. The statistics are illustrated via an application involving real data, and their performance is studied via simulation.
One of the most popular instruments used to assess perceived social support is the Multidimensional Scale of Perceived Social Support (MSPSS). Although the original structure of the MSPSS was defined to include three specific factors (significant others, friends and family), studies in the literature propose different factor solutions. In this study, we addressed the controversial factor structure of the MSPSS using a meta-analytic confirmatory factor analysis approach. For this purpose, we utilized studies in the literature that examined and reported the internal structure of the MSPSS. However, we used summary data from 59 samples of 54 studies (total N = 27,905) after excluding studies that did not meet the inclusion criteria. We tested five different models discussed in the literature and found that the fit indices of the correlated 3-factor model and the bifactor model were quite good. Therefore, we also examined both models’ factor loadings and omega coefficients. Since there was no sharp difference between the two models and the theoretical structure of the scale was represented by the correlated three factors, we decided that the correlated three-factor model was more appropriate for the internal structure of the MSPSS. We then examined the measurement invariance for this model according to language and sample type (clinical and nonclinical) and found that metric invariance was achieved. As a result, we found that the three-factor structure of the MSPSS was supported in this study.
The two statistical approaches commonly used in the analysis of dyadic and group data, multilevel modeling and structural equation modeling, are reviewed. Next considered are three different models for dyadic data, focusing mostly on the very popular actor–partner interdependence model (APIM). We further consider power analyses for the APIM as well as the partition of nonindependence. We then present an overview of the analysis of over-time dyadic data, considering growth-curve models, the stability-and-influence model, and the over-time APIM. After that, we turn to group data and focus on considerations of the analysis of group data using multilevel modeling, including a discussion of the social relations model, which is a model of dyadic data from groups of persons. The final topic concerns measurement equivalence of constructs across members of different types in dyadic and group studies.
In this chapter we review advanced psychometric methods for examining the validity of self-report measures of attitudes, beliefs, personality style, and other social psychological and personality constructs that rely on introspection. The methods include confirmatory-factor analysis to examine whether measurements can be interpreted as meaningful continua, and measurement invariance analysis to examine whether items are answered the same way in different groups of people. We illustrate the methods using a measure of individual differences in openness to political pluralism, which includes four conceptual facets. To understand how the facets relate to the overall dimension of openness to political pluralism, we compare a second-order factor model and a bifactor model. We also check to see whether the psychometric patterns of item responses are the same for males and females. These psychometric methods can both document the quality of obtained measurements and inform theorists about nuances of their constructs.
Few studies have examined the psychometric properties of the Connor-Davidson Resilience Scale (CD-RISC) in a large adolescent community sample, finding a significant disparity. This study explores the psychometric properties of the CD-RISC among Spanish adolescents by means of exploratory factor analysis (EFA), Rasch analysis, and measurement invariance (MI) across sex, as well as internal consistency and criterion validity. The sample was comprised of 463 adolescents (231 girls), aged 12 to 18 years, who completed the CD-RISC and other measures on emotional status and quality of life. The EFA suggested that the CD-RISC structure presented a unidimensional model. Consequently, shorter unidimensional CD-RISC models observed in the literature were explored. Thus, the Campbell-Sills and Stein CD–RISC–10 showed the soundest psychometric properties, providing an adequate item fit and supporting MI and non-differential item functioning across sex. Item difficulty levels were biased toward low levels of resilience. Some items showed malfunctioning in lower response categories. With regard to reliability, categorical omega was. 82. Strong associations with health-related quality of life, major depressive disorder symptoms, and emotional symptoms were observed. A weak association was found between resilience and the male sex. Campbell-Sills and Stein’s CD–RISC–10 model emerges as the best to assess resilience among Spanish adolescents, as already reported in adults. Thus, independently of the developmental stage, the core of resilience may reside in the aspects of hardiness and persistence.