To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In latent space item response models (LSIRMs), subjects and items are embedded in a low-dimensional Euclidean latent space. As such, interactions among persons and/or items can be revealed that are unmodeled in conventional item response theory models. Current estimation approach for LSIRMs is a fully Bayesian procedure with Markov Chain Monte Carlo, which is, while practical, computationally challenging, hampering applied researchers to use the models in a wide range of settings. Therefore, we propose an LSIRM based on two variants of regularized joint maximum likelihood (JML) estimation: penalized JML and constrained JML. Owing to the absence of integrals in the likelihood, the JML methods allow for various models to be fit in limited amount of time. This computational speed facilitates a practical extension of LSIRMs to ordinal data, and the possibility to select the dimensionality of the latent space using cross-validation. In this study, we derive the two JML approaches and address different issues that arise when using maximum likelihood to estimate the LSIRM. We present a simulation study demonstrating acceptable parameter recovery and adequate performance of the cross-validation procedure. In addition, we estimate different binary and ordinal LSIRMs on real datasets pertaining to deductive reasoning and personality. All methods are implemented in R package ‘LSMjml’ which is available from CRAN.
This chapter explores advanced applications of network machine learning for multiple networks. We introduce anomaly detection in time series of networks, identifying significant structural changes over time. The chapter then focuses on signal subnetwork estimation for network classification tasks. We present both incoherent and coherent approaches, with incoherent methods identifying edges that best differentiate between network classes, and coherent methods leveraging additional network structure to improve classification accuracy. Practical applications, such as classifying brain networks, are emphasized throughout. These techniques apply to collections of networks, providing a toolkit for analyzing and classifying complex, multinetwork datasets. By integrating previous concepts with new methodologies, we offer a framework for extracting insights and making predictions from diverse network structures with associated attributes.
What is the optimal level of questionnaire detail required to measure bilingual language experience? This empirical evaluation compares alternative measures of language exposure of varying cost (i.e., questionnaire detail) in terms of their performance as predictors of oral language outcomes. The alternative measures were derived from Q-BEx questionnaire data collected from a diverse sample of 121 heritage bilinguals (5–9 years of age) growing up in France, the Netherlands and the UK. Outcome data consisted of morphosyntax and vocabulary measures (in the societal language) and parental estimates of oral proficiency (in the heritage language). Statistical modelling exploited information-theoretic and cross-validation approaches to identify the optimal language exposure measure. Optimal cost–benefit was achieved with cumulative exposure (for the societal language) and current exposure in the home (for the heritage language). The greatest level of questionnaire detail did not yield more reliable predictors of language outcomes.
Typical Bayesian methods for models with latent variables (or random effects) involve directly sampling the latent variables along with the model parameters. In high-level software code for model definitions (using, e.g., BUGS, JAGS, Stan), the likelihood is therefore specified as conditional on the latent variables. This can lead researchers to perform model comparisons via conditional likelihoods, where the latent variables are considered model parameters. In other settings, however, typical model comparisons involve marginal likelihoods where the latent variables are integrated out. This distinction is often overlooked despite the fact that it can have a large impact on the comparisons of interest. In this paper, we clarify and illustrate these issues, focusing on the comparison of conditional and marginal Deviance Information Criteria (DICs) and Watanabe–Akaike Information Criteria (WAICs) in psychometric modeling. The conditional/marginal distinction corresponds to whether the model should be predictive for the clusters that are in the data or for new clusters (where “clusters” typically correspond to higher-level units like people or schools). Correspondingly, we show that marginal WAIC corresponds to leave-one-cluster out cross-validation, whereas conditional WAIC corresponds to leave-one-unit out. These results lead to recommendations on the general application of the criteria to models with latent variables.
In item response theory (IRT), it is often necessary to perform restricted recalibration (RR) of the model: A set of (focal) parameters is estimated holding a set of (nuisance) parameters fixed. Typical applications of RR include expanding an existing item bank, linking multiple test forms, and associating constructs measured by separately calibrated tests. In the current work, we provide full statistical theory for RR of IRT models under the framework of pseudo-maximum likelihood estimation. We describe the standard error calculation for the focal parameters, the assessment of overall goodness-of-fit (GOF) of the model, and the identification of misfitting items. We report a simulation study to evaluate the performance of these methods in the scenario of adding a new item to an existing test. Parameter recovery for the focal parameters as well as Type I error and power of the proposed tests are examined. An empirical example is also included, in which we validate the pediatric fatigue short-form scale in the Patient-Reported Outcome Measurement Information System (PROMIS), compute global and local GOF statistics, and update parameters for the misfitting items.
The use of multilevel VAR(1) models to unravel within-individual process dynamics is gaining momentum in psychological research. These models accommodate the structure of intensive longitudinal datasets in which repeated measurements are nested within individuals. They estimate within-individual auto- and cross-regressive relationships while incorporating and using information about the distributions of these effects across individuals. An important quality feature of the obtained estimates pertains to how well they generalize to unseen data. Bulteel and colleagues (Psychol Methods 23(4):740–756, 2018a) showed that this feature can be assessed through a cross-validation approach, yielding a predictive accuracy measure. In this article, we follow up on their results, by performing three simulation studies that allow to systematically study five factors that likely affect the predictive accuracy of multilevel VAR(1) models: (i) the number of measurement occasions per person, (ii) the number of persons, (iii) the number of variables, (iv) the contemporaneous collinearity between the variables, and (v) the distributional shape of the individual differences in the VAR(1) parameters (i.e., normal versus multimodal distributions). Simulation results show that pooling information across individuals and using multilevel techniques prevent overfitting. Also, we show that when variables are expected to show strong contemporaneous correlations, performing multilevel VAR(1) in a reduced variable space can be useful. Furthermore, results reveal that multilevel VAR(1) models with random effects have a better predictive performance than person-specific VAR(1) models when the sample includes groups of individuals that share similar dynamics.
Using Carroll's external analysis, several studies have found that unfolding models account for more, although seldom significantly more, variance in preferences than Tucker's vector model. In studies of sociometric ratings and political preferences, the unfolding model again rarely outpredicted the vector model by a significant amount. Yet on cross-validation, the unfolding model consistently accounted for more variance. Results suggest that sometimes significance tests are less sensitive than cross-validation procedures to the small but consistent superiority of the unfolding model. Future researchers may wish to use significance tests and cross-validation techniques in comparing models.
A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.
It is widely believed that a joint factor analysis of item responses and response time (RT) may yield more precise ability scores that are conventionally predicted from responses only. For this purpose, a simple-structure factor model is often preferred as it only requires specifying an additional measurement model for item-level RT while leaving the original item response theory (IRT) model for responses intact. The added speed factor indicated by item-level RT correlates with the ability factor in the IRT model, allowing RT data to carry additional information about respondents’ ability. However, parametric simple-structure factor models are often restrictive and fit poorly to empirical data, which prompts under-confidence in the suitablity of a simple factor structure. In the present paper, we analyze the 2015 Programme for International Student Assessment mathematics data using a semiparametric simple-structure model. We conclude that a simple factor structure attains a decent fit after further parametric assumptions in the measurement model are sufficiently relaxed. Furthermore, our semiparametric model implies that the association between latent ability and speed/slowness is strong in the population, but the form of association is nonlinear. It follows that scoring based on the fitted model can substantially improve the precision of ability scores.
We first discuss a phenomenon called data mining. This can involve multiple tests on which variables or correlations are relevant. If used improperly, data mining may associate with scientific misconduct. Next, we discuss one way to arrive at a single final model, involving stepwise methods. We see that various stepwise methods lead to different final models. Next, we see that various configurations in test situations, here illustrated for testing for cointegration, lead to different outcomes. It may be possible to see which configurations make most sense and can be used for empirical analysis. However, we suggest that it is better to keep various models and somehow combine inferences. This is illustrated by an analysis of the losses in airline revenues in the United States owing to 9/11. We see that out of four different models, three estimate a similar loss, while the fourth model suggests only 10 percent of that figure. We argue that it is better to maintain various models, that is, models that stand various diagnostic tests, for inference and for forecasting, and to combine what can be learned from them.
from
Part III
-
Methodological Challenges of Experimentation in Sociology
Davide Barrera, Università degli Studi di Torino, Italy,Klarita Gërxhani, Vrije Universiteit, Amsterdam,Bernhard Kittel, Universität Wien, Austria,Luis Miller, Institute of Public Goods and Policies, Spanish National Research Council,Tobias Wolbring, School of Business, Economics and Society at the Friedrich-Alexander-University Erlangen-Nürnberg
This chapter addresses the often-misunderstood concept of validity. Much of the methodological discussion around sociological experiments is framed in terms of internal and external validity. The standard view is that the more we ensure that the experimental treatment is isolated from potential confounds (internal validity), the more unlikely it is that the experimental results can be representative of phenomena of the outside world (external validity). However, other accounts describe internal validity as a prerequisite of external validity: Unless we ensure internal validity of an experiment, little can be said of the outside world. We contend in this chapter that problems of either external or internal validity do not necessarily depend on the artificiality of experimental settings or on the laboratory–field distinction between experimental designs. We discuss the internal–external distinction and propose instead a list of potential threats to the validity of experiments that includes "usual suspects" like selection, history, attrition, and experimenter demand effects and elaborate on how these threats can be productively handled in experimental work. Moreover, in light of the different types of experiments, we also discuss the strengths and weaknesses of each regarding threats to internal and external validity.
We can easily find ourselves with lots of predictors. This situation has been common in ecology and environmental science but has spread to other biological disciplines as genomics, proteomics, metabolomics, etc., become widespread. Models can become very complex, and with many predictors, collinearity is more likely. Fitting the models is tricky, particularly if we’re looking for the “best” model, and the way we approach the task depends on how we’ll use the model results. This chapter describes different model selection approaches for multiple regression models and discusses ways of measuring the importance of specific predictors. It covers stepwise procedures, all subsets, information criteria, model averaging and validation, and introduces regression trees, including boosted trees.
A good model aims to learn the underlying signal without overfitting (i.e. fitting to the noise in the data). This chapter has four main parts: The first part covers objective functions and errors. The second part covers various regularization techniques (weight penalty/decay, early stopping, ensemble, dropout, etc.) to prevent overfitting. The third part covers the Bayesian approach to model selection and model averaging. The fourth part covers the recent development of interpretable machine learning.
This chapter discusses the problem of selecting predictors in a linear regression model, which is a special case of model selection. One might think that the best model is the one with the most predictors. However, each predictor is associated with a parameter that must be estimated, and errors in the estimation add uncertainty to the final prediction. Thus, when deciding whether to include certain predictors or not, the associated gain in prediction skill should exceed the loss due to estimation error. Model selection is not easily addressed using a hypothesis testing framework because multiple testing is involved. Instead, the standard approach is to define a criterion for preferring one model over another. One criterion is to select the model that gives the best predictions of independent data. By independent data, we mean data that is generated independently of the sample that was used to inform the model building process. Criteria for identifying the model that gives the best predictions in independent data include Mallows’ Cp, Akaike’s Information Criterion, Bayesian Information Criterion, and cross-validated error.
Given the rapid reductions in human mortality observed over recent decades and the uncertainty associated with their future evolution, there have been a large number of mortality projection models proposed by actuaries and demographers in recent years. Many of these, however, suffer from being overly complex, thereby producing spurious forecasts, particularly over long horizons and for small, noisy data sets. In this paper, we exploit statistical learning tools, namely group regularisation and cross-validation, to provide a robust framework to construct discrete-time mortality models by automatically selecting the most appropriate functions to best describe and forecast particular data sets. Most importantly, this approach produces bespoke models using a trade-off between complexity (to draw as much insight as possible from limited data sets) and parsimony (to prevent over-fitting to noise), with this trade-off designed to have specific regard to the forecasting horizon of interest. This is illustrated using both empirical data from the Human Mortality Database and simulated data, using code that has been made available within a user-friendly open-source R package StMoMo.
With rapid development in hardware storage, precision instrument manufacturing, and economic globalization etc., data in various forms have become ubiquitous in human life. This enormous amount of data can be a double-edged sword. While it provides the possibility of modeling the world with a higher fidelity and greater flexibility, improper modeling choices can lead to false discoveries, misleading conclusions, and poor predictions. Typical data-mining, machine-learning, and statistical-inference procedures learn from and make predictions on data by fitting parametric or non-parametric models. However, there exists no model that is universally suitable for all datasets and goals. Therefore, a crucial step in data analysis is to consider a set of postulated candidate models and learning methods (the model class) and select the most appropriate one. We provide integrated discussions on the fundamental limits of inference and prediction based on model-selection principles from modern data analysis. In particular, we introduce two recent advances of model-selection approaches, one concerning a new information criterion and the other concerning modeling procedure selection.
Classification of wines with a large number of correlated covariates may lead to classification results that are difficult to interpret. In this study, we use a publicly available dataset on wines from three known cultivars, where there are 13 highly correlated variables measuring chemical compounds of wines. The goal is to produce an efficient classifier with straightforward interpretation to shed light on the important features of wines in the classification. To achieve the goal, we incorporate principal component analysis (PCA) in the k-nearest neighbor (kNN) classification to deal with the serious multicollinearity among the explanatory variables. PCA can identify the underlying dominant features and provide a more succinct and straightforward summary over the correlated covariates. The study shows that kNN combined with PCA yields a much simpler and interpretable classifier that has comparable performance with kNN based on all the 13 variables. The appropriate number of principal components is chosen to strike a balance between predictive accuracy and simplicity of interpretation. Our final classifier is based on only two principal components, which can be interpreted as the strength of taste and level of alcohol and fermentation in wines, respectively. (JEL Classifications: C10, Cl4, D83)
Conveyor belt wear is an important consideration in the bulk materials handling industry. We define four belt wear rate metrics and develop a model to predict wear rates of new conveyor configurations using an industry dataset that includes ultrasonic thickness measurements, conveyor attributes, and conveyor throughput. All variables are expected to contribute in some way to explaining wear rate and are included in modeling. One specific metric, the maximum throughput-based wear rate, is selected as the prediction target, and cross-validation is used to evaluate the out-of-sample performance of random forest and linear regression algorithms. The random forest approach achieves a lower error of 0.152 mm/megatons (standard deviation [SD] = 0.0648). Permutation importance and partial dependence plots are computed to provide insights into the relationship between conveyor parameters and wear rate. This work demonstrates how belt wear rate can be quantified from imprecise thickness testing methods and provides a transparent modeling framework applicable to other supervised learning problems in risk and reliability.
This chapter focuses on model evaluation and selection in Hierarchical Modelling of Species Communities (HMSC). It starts by noting that even if there are automated procedures for model selection, the most important step is actually done by the ecologist when deciding what kind of models will be fitted. The chapter then discusses different ways of measuring model fit based on contrasting the model predictions with the observed data, as well as the use of information criteria as a method for evaluating model fit. The chapter first discusses general methods that can be used to compare models that differ either in their predictors or in their structure, e.g. models with different sets of environmental covariates, models with and without spatial random effects, models with and without traits or phylogenetic information or models that differ in their prior distributions. The chapter then presents specific methods for variable selection, aimed at comparing models that are structurally identical but differ in the included environmental covariates: variable selection by the spike and slab prior approach, and reduced rank regression that aims at combining predictors to reduce their dimensionality.
Environmental factors such as sunshine hours, temperature and UV radiation (UVR) are known to influence seasonal fluctuations in vitamin D concentrations. However, currently there is poor understanding regarding the environmental factors or individual characteristics that best predict neonatal 25-hydroxyvitamin D (25(OH)D) concentrations. The aims of this study were to (1) identify environmental and individual determinants of 25(OH)D concentrations in newborns and (2) investigate whether environmental factors and individual characteristics could be used as proxy measures for neonatal 25(OH)D concentrations. 25-Hydroxyvitamin D3 (25(OH)D3) was measured from neonatal dried blood spots (DBS) of 1182 individuals born between 1993 and 2002. Monthly aggregated data on daily number of sunshine hours, temperature and UVR, available from 1993, were retrieved from the Danish Meteorological Institute. The individual predictors were obtained from the Danish National Birth register, and Statistics Denmark. The optimal model to predict 25(OH)D3 concentrations from neonatal DBS was the one including the following variables: UVR, temperature, maternal education, maternal smoking during pregnancy, gestational age at birth and parity. This model explained 30 % of the variation of 25(OH)D3 in the neonatal DBS. Ambient UVR in the month before the birth month was the best single-item predictor of neonatal 25(OH)D3, accounting for 24 % of its variance. Although this prediction model cannot substitute for actual blood measurements, it might prove useful in cohort studies ranking individuals in groups according to 25(OH)D3 status.