To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Designed for the applied practitioner, this book is a compact, entry-level guide to modeling and analyzing non-Gaussian and correlated data. Many practitioners work with data that fail the assumptions of the common linear regression models, necessitating more advanced modeling techniques. This Handbook presents clearly explained modeling options for such situations, along with extensive example data analyses. The book explains core models such as logistic regression, count regression, longitudinal regression, survival analysis, and structural equation modelling without relying on mathematical derivations. All data analyses are performed on real and publicly available data sets, which are revisited multiple times to show differing results using various modeling options. Common pitfalls, data issues, and interpretation of model results are also addressed. Programs in both R and SAS are made available for all results presented in the text so that readers can emulate and adapt analyses for their own data analysis needs. Data, R, and SAS scripts can be found online at http://www.spesi.org.
Explosive growth in computing power has made Bayesian methods for infinite-dimensional models - Bayesian nonparametrics - a nearly universal framework for inference, finding practical use in numerous subject areas. Written by leading researchers, this authoritative text draws on theoretical advances of the past twenty years to synthesize all aspects of Bayesian nonparametrics, from prior construction to computation and large sample behavior of posteriors. Because understanding the behavior of posteriors is critical to selecting priors that work, the large sample theory is developed systematically, illustrated by various examples of model and prior combinations. Precise sufficient conditions are given, with complete proofs, that ensure desirable posterior properties and behavior. Each chapter ends with historical notes and numerous exercises to deepen and consolidate the reader's understanding, making the book valuable for both graduate students and researchers in statistics and machine learning, as well as in application areas such as econometrics and biostatistics.
In this chapter we discuss the use of normal multiple linear regression models. The models are said to be normal because we assume the errors (and consequently the responses) are distributed normally. The term multiple refers using more than one predictor to account for the variation in the response variable. We say the models are linear in the parameters as described in Chapter 1. The prediction functions do not need to be linear, but the regression coefficients cannot be involved in any nonlinear expressions. Regression refers to the estimation method that regresses a data-fitted line toward the true line which rarely is achieved or known.
Recall that a model is useful only if the data and variables are appropriate to the subject matter. When this is the case, we may construct a model for which we expect the predictors to influence changes to the response.
Multiple predictors are important for describing changes in the response variable for two primary reasons: processes for modeling are rarely simple, and multiple numbers of predictors usually more thoroughly partition the variance exhibited in the response. Pairing individual predictors with the response to assess pairwise correlations averages out variation due to other omitted predictors, leaving the amount of unexplained variance large. Often the level of a response is better understood when two or more predictors interact. The level of one predictor may alter the correlative relationship with some other predictor on a response. Another reason for multiple predictors in a predictive regression model is it often better accommodates new data.
We examine the ability of normal multiple linear regression to produce viable models on each of the four data sets presented in Chapter 1.We subject each data set to normal multiple linear regression for two reasons: firstly, we have assumed that users of this handbook are familiar with basic statistics and normal multiple linear regression; and secondly, we use these analyses as a baseline, or reference, from which to compare the outcomes of the models presented in Chapter 2. These outcomes are presented in the chapters following this one.
Given a data set or a design for collecting data, it is the task of the data analyst to match the data to an appropriate model. The selection of an appropriate model depends on a number of factors, including the goals or intentions of the study, properties of the data collected, and the nature of the conclusions the analyst would like to make. In most cases many models can be deemed appropriate for one data set, and the analyst must select one or many appropriate models to address the goals of the study. The data analyst cannot focus on “right” or “wrong” models, but must instead balance the relative strengths and weaknesses of different modeling choices. The analyst must also consider the availability of computing resources, interpretability of results, and the ability of the analyst herself.
Very generally, the data analyst must consider the specific goals or questions that need to be addressed by the study, including whether there is an interest in evaluating model effects, in making predictions using the model, or both. The analyst must also consider the nature of predictors for the analysis, including whether the predictors are continuous or categorical, whether interactions between predictors should be considered, whether any predictors should be considered as a source of random variation, whether any predictors present as time-dependent, and so on. Perhaps most relevant to the discussions from this handbook, the data analyst must consider the nature of the data collected, including whether the outcome of interest is continuous, skewed, categorical, longitudinal, or otherwise. Figure 10.1 shows some very general properties of the response of interest that must be determined by the data analyst when matching data to an appropriate model. First, the analyst must determine the type of data representing the outcome of interest which is represented by the top node, “outcome variable.” Exploratory data analyses corroborate the choice of the three options for the outcome variable given in the second tier of nodes.
This handbook is designed to provide an accessible introduction to statistical modeling techniques appropriate for data that are non-Gaussian (not normally distributed), do not have observations independent of each other, or may not be linearly related to selected predictors. The discussion relies heavily on data examples and includes thorough explorations of data sets, model construction and evaluation, detailed interpretations of model results, and model-based predictions. We intend to provide readers with a sufficiently thorough and understandable analysis process such that the techniques covered in this text can be readily applied to any similar data situation. However, it is important to understand that we use specific data sets with the various models strictly for demonstrative purposes. The outcomes we present are not to be assumed as definitive representations of information contained within the data sets.
Throughout the text, we will use four data sets (each described in this chapter) to exhibit the analytical methods including exploration of the data, building appropriate models (Chapter 2), evaluating the appropriateness of the models, output interpretation, and predictions made by the models. The purpose of using the same data sets throughout is to show that multiple methods can be applied to similar or identical variables of interest, possibly resulting in different conclusions. Consistent use of the same data sets should maintain data familiarity. After reading this first chapter, the intention of every data analysis throughout the remainder of the text should be understood. The modeling methods that are applied to the data sets are models for responses with constant variance (Chapter 3), responses with nonconstant variance (Chapter 4), discrete categorical responses (Chapter 5), models for count responses (Chapter 6), responses that are time-dependent (time-to-event data in Chapter 7, and outcomes collected over time in Chapter 8), and models for which variables that cannot be measured directly but are represented by variables that are measurable (Chapter 9). The last chapter, Chapter 10, is a guide to matching data sets to model types.
Time-to-event (TTE) data are used to describe the probability of an event occurring by some specified time. TTE probability differs from logistic regression as the binary response analysis is focused on determining the probability of the event any time in the study. Commonly studied TTE data include the probability of patients in a cohort to be diagnosed with a hospital-related infection following a surgery, or the probability of automobiles failing a roadworthiness test. Directly related to TTE analysis is survival analysis which is the proportion of subjects or units surviving past a specified time. Two examples of survival analysis are the proportion of patients surviving a medical procedure, and the proportion of relay switches still functioning after an electrical current surge on a power grid. These four examples are events that are dependent on time. An example of an event that is not time dependent but still a function of an ordered sequence is how many meters of mass storage magnetic tape pass over the read/write head of the tape drive before a fatal fault is detected. In each of these examples, an event of interest is a function of an ordered sequence, either time or distance.
As TTE data are dependent on an ordered sequence, it is critical to identify disturbances in the sequencing of the events. A school system wishes to administer a curriculum designed to enable passing a sequence of calculus exams to qualified students, but several of the selected students have already passed at least the first of the exams prior to administration of the curriculum. A patient in a drug protocol study enters the study, experiences the desired response to the drug, then exits the study prior to the study end. A machine part fails in the time between two scheduled, routine inspections, resulting in an indeterminate exact time of the failure. A patient enters a study but withdraws prior to either the event of interest or the termination of the study. Finally, a student has no absences in a semester used for finding the time to an absence. Each of these examples of disturbance relative to a TTE sequence is a form of censored data.
One of the assumptions of the normal regression model is the homogeneity of variance. That is, it is assumed that each response represents an outcome from a (normal) distribution with the same, constant variation, regardless of the values of associated predictors. However, this property of homoscedasticity is often violated by responses that show fluctuations in variation across values of predictors as was seen in Chapter 3 using the four data sets. In such cases researchers will still be interested in evaluating all of the same questions that can be answered using normal regression models, but must turn to alternative modeling techniques.
For example, education researchers may collect records of standardized test scores for students in multiple classrooms with an interest in comparing performance across different student demographics. However, in analyzing the data it may be seen that students in classes with more experienced instructors show more consistency in their performances, and therefore lower variation in scores than for students in classes with less experienced instructors. As another example, health policy researchers might be interested in connections between family income level and health and medical expenditures. When data are collected, data analysts may see that the health spending varies greatly for families with significant financial resources, but health spending is consistent for families less financial resources. In both examples the variability of the outcome of interest changes depending on different subject characteristics.
Within any normal regression model, inferential methods such as t-tests and F-tests are constructed under the assumption of homoscedasticity. Using this assumption, “pooled” estimates of variation are made using the data, including the random unexplained variation captured by the mean-squared error (MSE). If such inferential methods are applied to heteroscedastic data, the pooled estimates such as MSE can either underestimate or overestimate the true variation in the response, and therefore hypothesis-testing procedures will be unreliable, often in unpredictable ways. Therefore it is important for data analysts to be able to detect and correct for violations of the constant-variance assumption.
Modern society is data driven. When you buy – or even shop for – a shirt on the Internet, the next time you enter the web, you'll be inundated with advertisements for more shirts, all the outcome of data collection, analysis, and targeted marketing. Global networks have been designed specifically to deliver stock market and commodities market data for near real-time trading. Public services depend heavily on censuses for allocation of government funding and assistance programs to the populations that need them. These same censuses determine the districts needed for so-called enfranchisement, at least in the United States. Travel, particularly international, is regulated based on personal information collected by government agencies. Large chain retailers collect cash-out data to stock according to collective shopping habits. Educators undertake quantitative assessments of new instructional methods to determine best practice. Health policy administrators analyze data to allocate resources according to the timing and volume of patient needs. These applications are just a hint of the universal use of data in both public and private spheres.
The ubiquity of data-driven decisions means that our personal and collective lives are affected daily by how data are analyzed and interpreted. When data are interpreted accurately, we expect fair treatment. When data are improperly collected, analyzed, or interpreted, not only is our quality of life diminished, but the faulty information can debilitate or even kill. Clearly, then, we want data analysts who, conscious of the consequences of poor or incorrect analyses, have the knowledge to extract information from data – properly and with a healthy awareness of any uncertainties that should qualify interpretation.
To support this kind of mastery, we have written this handbook to overcome two common limitations in tutorial resources for practicing data analysts.
• We make a broad selection of the most useful basic models, from a range of disciplines and domains. Applied disciplines that use statistical analysis sometimes rely on a restricted set of tools particular to the discipline. Although this practice has advantages at the entry level, it can encourage overreliance on familiar methods to the exclusion of viable, even superior, alternatives.
Structural equation modeling (SEM) may be considered as using models of covarying variables. The models systematize variables by type, either manifest (observable and measurable variables) or latent (not directly measurable variables), that depend on the covariance of other manifest or latent variables. This includes the popular regression models of manifest variables dependent on covarying manifest predictor variables. An extension of regression models allows for the analysis of latent variable effects underlying manifest predictor variables that influence a manifest response variable. Latent prediction variables characterizing a response leads to models in which one or more latent variables that influence the covariance behavior of a set of manifest variables. Finally, SEM includes models of latent variables influencing not only manifest variables, but also other latent variables. SEM, then, is a set of models designed to manage a variety of relationships among both observable and unobservable variables.
Prior to utilizing SEM, analysts and researchers must formulate a hypothesis relating the variables of interest. That is, SEM requires subject matter expertise to devise meaningful variable covariance structures. The outcome of SEM analysis is assessments of chosen data set substantiation of the proposed hypotheses. The validity of SEM lies in the unification the analysis and the subject matter.
Intelligence quotient measures, motivation level identification, and ancient craterproducing impactor properties are examples of problems that can be examined with SEM. Each of these examples of latent behavior and resurfacing agents require the quantification of proxy or aftermath observable measurements, and are investigations of proposed or hypothesized relationships among a set of concomitant variables. Intelligence quotients often cannot be measured directly, so other related variables such as quantitative reasoning, reading and writing ability, short- and long-term memory, and visual processing are measured and hypothesized to be the outcomes of levels of intelligence. Purchasing motivation may be hypothesized to be the influencer of internet link paths among websites including the website link sequence and the website dwell time. Scientists hypothesize and have sample evidence that asteroids and comets caused impact craters on the Moon. Variables that can be measured to associate the observed impact crater characteristics with the causal impactor latent unobserved characteristics are the observed orbiting asteroids and comets, existing crater diameters, rim-to-floor depths, and ejecta spatial patterns.