To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Choose the type of multivariable model based on the type of outcome variable you have. Perform univariate statistics to understand the distribution of your independent and outcome variables. Perform bivariate analysis of your independent variables. Run a correlation matrix to understand how your independent variables are related to another. Assess your missing data. Perform your analysis and assess how well your model fits the data. Assess the strength of your individual covariates in estimating outcome. Use regression diagnostics to assess the underlying assumptions of your model. Perform sensitivity analyses to assess the robustness of your findings and consider whether it would be possible to validate your model. Publish your work and soak up the glory.
In setting up your model, include those variables, in addition to the risk factor or group assignment, that have been theorized or shown in prior research to be confounders or those that empirically are associated with the risk factor and the outcome in bivariate analysis.
Exclude variables that are on the intervening pathway between the risk factor and outcome, those that are extraneous because they are not on the causal pathway, redundant variables, and variables with a lot of missing data.
Sample size calculation for multivariable analysis is complicated but statistical programs exist to help you to calculate it. Missing data on independent variables can compromise your multivariable analysis. Several methods exist to compensate for missing independent data including deleting cases, using indicator variables to represent missing data and estimating the value of missing cases. Methods also exist for estimating missing outcome data using other data you have about the subject and multiple imputation.
Sensitivity analysis tests how robust the results are to changes in the underlying assumptions of your analysis. In other words, if you made plausible changes in your assumptions, would you still draw the same conclusions? The changes could be a more restrictive or inclusive sample, a different way to measure your variables, a different way for handling missing data, or a change of a different feature of your analysis. With sensitivity analysis you cannot lose. If you vary the assumptions of your analysis and you get the same result, you will have more confidence in the conclusions of your study. Conversely, if plausible changes in your assumptions lead to a different conclusion, you will have learned something important. A common assumption tested in sensitivity analysis is that there are no unmeasured confounders, which can be tested with E values or falsification analysis. Other common assumptions tested are that losses to follow-up are random, that the sample is unbiased, that there is the correct exposure period and follow-up period, that there is a biased predictor or outcome, or that the model is misspecified.
Modern quantitative evidence synthesis methods often combine patient-level data from different sources, known as individual participants data (IPD) sets. A specific challenge in meta-analysis of IPD sets is the presence of systematically missing data, when certain variables are not measured in some studies, and sporadically missing data, when measurements of certain variables are incomplete across different studies. Multiple imputation (MI) is among the better approaches to deal with missing data. However, MI of hierarchical data, such as IPD meta-analysis, requires advanced imputation routines that preserve the hierarchical data structure and accommodate the presence of both systematically and sporadically missing data. We have recently developed a new class of hierarchical imputation methods within the MICE framework tailored for continuous variables. This article discusses the extensions of this methodology to categorical variables, accommodating the simultaneous presence of systematically and sporadically missing data in nested designs with arbitrary missing data patterns. To address the challenge of the categorical nature of the data, we propose an accept–reject algorithm during the imputation process. Following theoretical discussions, we evaluate the performance of the new methodology through simulation studies and demonstrate its application using an IPD set from patients with kidney disease.
Workforce planning aims to model and predict supply and demand in medical specialties. In Scotland it is undertaken jointly by the Scottish Government and the Royal College of Psychiatrists in Scotland to ensure workforce sustainability. The survey described in this paper aimed to ascertain why doctors continue to choose to take a break from/delay training programmes or pursue alternative jobs and career pathways. Career breaks, time out of training, less than full-time working patterns, dual training and non-clinical careers need to be taken into account during workforce planning not only to make psychiatry an attractive specialty to work in, but to ensure robust future sustainability in the psychiatric workforce in Scotland and the UK.
Raw data require a great deal of cleaning, coding, and categorizing of observations. Vague standards for this data work can make it troublingly ad hoc, with much opportunity and temptation to influence the final results. Preprocessing rules and assumptions are not often seen as part of the model, but they can influence the result just as much as control variables or functional form assumptions. In this chapter, we discuss the main data processing decisions that analysts often face and how they can affect the results: coding and classifying of variables, processing anomalous and outlier observations, and the use of sample weights.
Establishing the effectiveness of treatments for psychopathology requires accurate models of its progression over time and the factors that impact it. Longitudinal data is however fraught with missingness, hindering accurate modeling. We re-analyse data on schizophrenia severity in a clinical trial using hidden Markov models (HMMs). We consider missing data in HMMs with a focus on situations where data is missing not at random (MNAR) and missingness depends on the latent states, allowing symptom severity to indirectly impact probability of missingness. In simulations, we show that including a submodel for state-dependent missingness reduces bias when data is MNAR and state-dependent, whilst not reducing accuracy when data is missing-at-random (MAR). When missingness depends on time, a model that allows missingness to be both state- and time-dependent is unbiased. Overall, these results show that modelling missingness as state-dependent and including relevant covariates is a useful strategy in applications of HMMs to time-series with missing data. Applying the model to data from a clinical trial, we find that drop-out is more likely for patients with less severe symptoms, which may lead to a biased assessment of treatment effectiveness.
In the present paper a model for describing dynamic processes is constructed by combining the common Rasch model with the concept of structurally incomplete designs. This is accomplished by mapping each item on a collection of virtual items, one of which is assumed to be presented to the respondent dependent on the preceding responses and/or the feedback obtained. It is shown that, in the case of subject control, no unique conditional maximum likelihood (CML) estimates exist, whereas marginal maximum likelihood (MML) proves a suitable estimation procedure. A hierarchical family of dynamic models is presented, and it is shown how to test special cases against more general ones. Furthermore, it is shown that the model presented is a generalization of a class of mathematical learning models, known as Luce's beta-model.
Consider an old test X consisting of s sections and two new tests Y and Z similar to X consisting of p and q sections respectively. All subjects are given test X plus two variable sections from either test Y or Z. Different pairings of variable sections are given to each subsample of subjects. We present a method of estimating the covariance matrix of the combined test (X1, ..., Xs, Y1, ..., Yp, Z1, ..., Zq) and describe an application of these estimation techniques to linear, observed-score, test equating.
The posterior distribution of the bivariate correlation is analytically derived given a data set where x is completely observed but y is missing at random for a portion of the sample. Interval estimates of the correlation are then constructed from the posterior distribution in terms of highest density regions (HDRs). Various choices for the form of the prior distribution are explored. For each of these priors, the resulting Bayesian HDRs are compared with each other and with intervals derived from maximum likelihood theory.
Standard procedures for drawing inferences from complex samples do not apply when the variable of interest θ cannot be observed directly, but must be inferred from the values of secondary random variables that depend on θ stochastically. Examples are proficiency variables in item response models and class memberships in latent class models. Rubin's “multiple imputation” techniques yield approximations of sample statistics that would have been obtained, had θ been observable, and associated variance estimates that account for uncertainty due to both the sampling of respondents and the latent nature of θ. The approach is illustrated with data from the National Assessment for Educational Progress.
Missing data occur in many real world studies. Knowing the type of missing mechanisms is important for adopting appropriate statistical analysis procedure. Many statistical methods assume missing completely at random (MCAR) due to its simplicity. Therefore, it is necessary to test whether this assumption is satisfied before applying those procedures. In the literature, most of the procedures for testing MCAR were developed under normality assumption which is sometimes difficult to justify in practice. In this paper, we propose a nonparametric test of MCAR for incomplete multivariate data which does not require distributional assumptions. The proposed test is carried out by comparing the distributions of the observed data across different missing-pattern groups. We prove that the proposed test is consistent against any distributional differences in the observed data. Simulation shows that the proposed procedure has the Type I error well controlled at the nominal level for testing MCAR and also has good power against a variety of non-MCAR alternatives.
Mediation analysis constitutes an important part of treatment study to identify the mechanisms by which an intervention achieves its effect. Structural equation model (SEM) is a popular framework for modeling such causal relationship. However, current methods impose various restrictions on the study designs and data distributions, limiting the utility of the information they provide in real study applications. In particular, in longitudinal studies missing data is commonly addressed under the assumption of missing at random (MAR), where current methods are unable to handle such missing data if parametric assumptions are violated.
In this paper, we propose a new, robust approach to address the limitations of current SEM within the context of longitudinal mediation analysis by utilizing a class of functional response models (FRM). Being distribution-free, the FRM-based approach does not impose any parametric assumption on data distributions. In addition, by extending the inverse probability weighted (IPW) estimates to the current context, the FRM-based SEM provides valid inference for longitudinal mediation analysis under the two most popular missing data mechanisms; missing completely at random (MCAR) and missing at random (MAR). We illustrate the approach with both real and simulated data.
In this paper, the constrained maximum likelihood estimation of a two-level covariance structure model with unbalanced designs is considered. The two-level model is reformulated as a single-level model by treating the group level latent random vectors as hypothetical missing-data. Then, the popular EM algorithm is extended to obtain the constrained maximum likelihood estimates. For general nonlinear constraints, the multiplier method is used at the M-step to find the constrained minimum of the conditional expectation. An accelerated EM gradient procedure is derived to handle linear constraints. The empirical performance of the proposed EM type algorithms is illustrated by some artifical and real examples.
A general approach for fitting a model to a data matrix by weighted least squares (WLS) is studied. This approach consists of iteratively performing (steps of) existing algorithms for ordinary least squares (OLS) fitting of the same model. The approach is based on minimizing a function that majorizes the WLS loss function. The generality of the approach implies that, for every model for which an OLS fitting algorithm is available, the present approach yields a WLS fitting algorithm. In the special case where the WLS weight matrix is binary, the approach reduces to missing data imputation.
Situations sometimes arise in which variables collected in a study are not jointly observed. This typically occurs because of study design. An example is an equating study where distinct groups of subjects are administered different sections of a test. In the normal maximum likelihood function to estimate the covariance matrix among all variables, elements corresponding to those that are not jointly observed are unidentified. If a factor analysis model holds for the variables, however, then all sections of the matrix can be accurately estimated, using the fact that the covariances are a function of the factor loadings. Standard errors of the estimated covariances can be obtained by the delta method. In addition to estimating the covariance matrix in this design, the method can be applied to other problems such as regression factor analysis. Two examples are presented to illustrate the method.
Existing test statistics for assessing whether incomplete data represent a missing completely at random sample from a single population are based on a normal likelihood rationale and effectively test for homogeneity of means and covariances across missing data patterns. The likelihood approach cannot be implemented adequately if a pattern of missing data contains very few subjects. A generalized least squares rationale is used to develop parallel tests that are expected to be more stable in small samples. Three factors were varied for a simulation: number of variables, percent missing completely at random, and sample size. One thousand data sets were simulated for each condition. The generalized least squares test of homogeneity of means performed close to an ideal Type I error rate for most of the conditions. The generalized least squares test of homogeneity of covariance matrices and a combined test performed quite well also.
Time limits are imposed on many computer-based assessments, and it is common to observe examinees who run out of time, resulting in missingness due to not-reached items. The present study proposes an approach to account for the missing mechanisms of not-reached items via response time censoring. The censoring mechanism is directly incorporated into the observed likelihood of item responses and response times. A marginal maximum likelihood estimator is proposed, and its asymptotic properties are established. The proposed method was evaluated and compared to several alternative approaches that ignore the censoring through simulation studies. An empirical study based on the PISA 2018 Science Test was further conducted.
Measures of agreement are used in a wide range of behavioral, biomedical, psychosocial, and health-care related research to assess reliability of diagnostic test, psychometric properties of instrument, fidelity of psychosocial intervention, and accuracy of proxy outcome. The concordance correlation coefficient (CCC) is a popular measure of agreement for continuous outcomes. In modern-day applications, data are often clustered, making inference difficult to perform using existing methods. In addition, as longitudinal study designs become increasingly popular, missing data have become a serious issue, and the lack of methods to systematically address this problem has hampered the progress of research in the aforementioned fields. In this paper, we develop a novel approach to tackle the complexities involved in addressing missing data and other related issues for performing CCC analysis within a longitudinal data setting. The approach is illustrated with both real and simulated data.
In knowledge space theory, existing adaptive assessment procedures can only be applied when suitable estimates of their parameters are available. In this paper, an iterative procedure is proposed, which upgrades its parameters with the increasing number of assessments. The first assessments are run using parameter values that favor accuracy over efficiency. Subsequent assessments are run using new parameter values estimated on the incomplete response patterns from previous assessments. Parameter estimation is carried out through a new probabilistic model for missing-at-random data. Two simulation studies show that, with the increasing number of assessments, the performance of the proposed procedure approaches that of gold standards.