To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Simple methods from introductory statistics have three important roles in regression and multilevel modeling. First, simple probability distributions are the building blocks for elaborate models. Second, multilevel models are generalizations of classical complete-pooling and no-pooling estimates, and so it is important to understand where these classical estimates come from. Third, it is often useful in practice to construct quick confidence intervals and hypothesis tests for small parts of a problem—before fitting an elaborate model, or in understanding the output from such a model.
This chapter provides a quick review of some of these methods.
Probability distributions
A probability distribution corresponds to an urn with a potentially infinite number of balls inside. When a ball is drawn at random, the “random variable” is what is written on this ball.
Areas of application of probability distributions include:
Distributions of data (for example, heights of men, heights of women, heights of adults), for which we use the notation yi, i = 1, …, n.
Distributions of parameter values, for which we use the notation θj, j = 1, …, J, or other Greek letters such as α, β, γ. We shall see many of these with the multilevel models in Part 2 of the book. For now, consider a regression model (for example, predicting students' grades from pre-test scores) fit separately in each of several schools. The coefficients of the separate regressions can be modeled as following a distribution, which can be estimated from data.
We start with an overview of classical linear regression and generalized linear models, focusing on practical issues of fitting, understanding, and graphical display. We also use this as an opportunity to introduce the statistical package R.
Chapter 9 discussed situations in which it is dangerous to use a standard linear regression of outcome on predictors and an indicator variable for estimating causal effects: when there is imbalance or lack of complete overlap or when ignorability is in doubt. This chapter discusses these issues in more detail and provides potential solutions for each.
Imbalance and lack of complete overlap
In a study comparing two treatments (which we typically label “treatment” and “control”), causal inferences are cleanest if the units receiving the treatment are comparable to those receiving the control. Until Section 10.5, we shall restrict ourselves to ignorable models, which means that we only need to consider observed pre-treatment predictors when considering comparability.
For ignorable models, we consider two sorts of departures from comparability—imbalance and lack of complete overlap. Imbalance occurs if the distributions of relevant pre-treatment variables differ for the treatment and control groups. Lack of complete overlap occurs if there are regions in the space of relevant pre-treatment variables where there are treated units but no controls, or controls but no treated units.
Imbalance and lack of complete overlap are issues for causal inference largely because they force us to rely more heavily on model specification and less on direct support from the data.
In this chapter we introduce the fitting of multilevel models in Bugs as run from R. Following a brief introduction to Bayesian inference in Section 16.2, we fit a varying-intercept multilevel regression, walking through each step of the model. The computations in this chapter parallel Chapter 12 on basic multilevel models. Chapter 17 presents computations for the more advanced linear and generalized linear models of Chapters 12–15.
Why you should learn Bugs
As illustrated in the preceding chapters, we can quickly and easily fit many multilevel linear and generalized linear models using the lmer() function in R. Functions such as lmer(), which use point estimates of variance parameters, are useful but can run into problems. When the number of groups is small or the multilevel model is complicated (with many varying intercepts, slopes, and non-nested components), there just might not be enough information to estimate variance parameters precisely. At that point, we can get more reasonable inferences using a Bayesian approach that averages over the uncertainty in all the parameters of the model.
We recommend the following strategy for multilevel modeling:
Start by fitting classical regressions using the lm() and glm() functions in R. Display and understand these fits as discussed in Part 1 of this book.
Set up multilevel models—that is, allow intercepts and slopes to vary, using non-nested groupings if appropriate—and fit using lmer(), displaying as discussed in most of the examples of Part 2A.
Most of this book concerns the interpretation of regression models, with the understanding that they can be fit to data fairly automatically using R and Bugs. However, it can be useful to understand some of the theory behind the model fitting, partly to connect to the usual presentation of these models in statistics and econometrics.
This chapter outlines some of the basic ideas of likelihood and Bayesian inference and computation, focusing on their application to multilevel regression. One point of this material is to connect multilevel modeling to classical regression; another is to give enough insight into the computation to allow you to understand some of the practical computational tips presented in the next chapter.
Least squares and maximum likelihood estimation
We first present the algebra for classical regression inference, which is then generalized when moving to multilevel modeling. We present the formulas here without derivation; see the references listed at the end of the chapter for more.
Least squares
The classical linear regression model is yi = Xiβ + ∊i, where y and ∊ are (column) vectors of length n, X is a n × k matrix, and β is a vector of length k. The vector β of coefficients is estimated so as to minimize the errors ∊i.
Multilevel modeling is applied to logistic regression and other generalized linear models in the same way as with linear regression: the coefficients are grouped into batches and a probability distribution is assigned to each batch. Or, equivalently (as discussed in Section 12.5), error terms are added to the model corresponding to different sources of variation in the data. We shall discuss logistic regression in this chapter and other generalized linear models in the next.
State-level opinions from national polls
Dozens of national opinion polls are conducted by media organizations before every election, and it is desirable to estimate opinions at the levels of individual states as well as for the entire country. These polls are generally based on national randomdigit dialing with corrections for nonresponse based on demographic factors such as sex, ethnicity, age, and education.
Here we describe a model developed for estimating state-level opinions from national polls, while simultaneously correcting for nonresponse, for any survey response of interest. The procedure has two steps: first fitting the model and then applying the model to estimate opinions by state:
We fit a regression model for the individual response y given demographics and state. This model thus estimates an average response θl for each cross-classification l of demographics and state. In our example, we have sex (male or female), ethnicity (African American or other), age (4 categories), education (4 categories), and 51 states (including the District of Columbia); thus l = 1, …, L = 3264 categories.
Multilevel modeling is typically motivated by features in existing data or the object of study—for example, voters classified by demography and geography, students in schools, multiple measurements on individuals, and so on. Consider all the examples in Part 2 of this book. In some settings, however, multilevel data structures arise by choice from the data collection process. We briefly discuss some of these options here.
Unit sampling or cluster sampling
In a sample survey, data are collected on a set of units in order to learn about a larger population. In unit sampling, the units are selected directly from the population. In cluster sampling, the population is divided into clusters: first a sample of clusters is selected, then data are collected from each of the sampled clusters.
In one-stage cluster sampling, complete information is collected within each sampled cluster. For example, a set of classrooms is selected at random from a larger population, and then all the students within each sampled classroom are interviewed. In two-stage cluster sampling, a sample is performed within each sampled cluster. For example, a set of classrooms is selected, and then a random sample of ten students within each classroom is selected and interviewed. More complicated sampling designs are possible along these lines, including adaptive designs, stratified cluster sampling, sampling with probability proportional to size, and various combinations and elaborations of these.
Think of a series of models, starting with the too-simple and continuing through to the hopelessly messy. Generally it's a good idea to start simple. Or start complex if you'd like, but prepare to quickly drop things out and move to the simpler model to help understand what's going on. Working with simple models is not a research goal—in the problems we work on, we usually find complicated models more believable—but rather a technique to help understand the fitting process.
A corollary of this principle is the need to be able to fit models relatively quickly. Realistically, you don't know what model you want to be fitting, so it's rarely a good idea to run the computer overnight fitting a single model. At least, wait until you've developed some understanding by fitting many models.
Do a little work to make your computations faster and more reliable
This sounds like computational advice but is really about statistics: if you can fit models faster, you can fit more models and better understand both data and model. But getting the model to run faster often has some startup cost, either in data preparation or in model complexity.
Data subsetting
Related to the “multiple model” approach are simple approximations that speed the computations. Computers are getting faster and faster—but models are getting more and more complicated! And so these general tricks might remain important.
We now discuss how to go beyond simply looking at regression coefficients, first by using simulation to summarize and propagate inferential uncertainty, and then by considering how regression can be used for causal inference.
Logistic regression is the standard way to model binary outcomes (that is, data yi that take on the values 0 or 1). Section 5.1 introduces logistic regression in a simple example with one predictor, then for most of the rest of the chapter we work through an extended example with multiple predictors and interactions.
Logistic regression with a single predictor
Example: modeling political preference given income
Conservative parties generally receive more support among voters with higher incomes. We illustrate classical logistic regression with a simple analysis of this pattern from the National Election Study in 1992. For each respondent i in this poll, we label yi = 1 if he or she preferred George Bush (the Republican candidate for president) or 0 if he or she preferred Bill Clinton (the Democratic candidate), for now excluding respondents who preferred Ross Perot or other candidates, or had no opinion. We predict preferences given the respondent's income level, which is characterized on a five-point scale.
The data are shown as (jittered) dots in Figure 5.1, along with the fitted logistic regression line, a curve that is constrained to lie between 0 and 1. We interpret the line as the probability that y = 1 given x—in mathematical notation, Pr(y = 1|x).
As with linear and logistic regressions, generalized linear models can be fit to multilevel structures by including coefficients for group indicators and then adding group-level models. We illustrate in this chapter with three examples from our recent applied research: an overdispersed Poisson model for police stops, a multinomial logistic model for storable voting, and an overdispersed Poisson model for social networks.
Overdispersed Poisson regression: police stops and ethnicity
We return to the New York City police example introduced in Sections 1.2 and 6.2, where we formulated the problem as an overdispersed Poisson regression, and here we generalize to a multilevel model. In order to compare ethnic groups while controlling for precinct-level variation, we perform multilevel analyses using the city's 75 precincts. Allowing precinct-level effects is consistent with theories of policing such as the “broken windows” model that emphasize local, neighborhood-level strategies. Because it is possible that the patterns are systematically different in neighborhoods with different ethnic compositions, we divide the precincts into three categories in terms of their black population: precincts that were less than 10% black, 10%–40% black, and more than 40% black. We also account for variation in stop rates between the precincts within each group. Each of the three categories represents roughly one-third of the precincts in the city, and we perform separate analyses for each set.
Overdispersion as a variance component
As discussed in Chapter 6, data that are fit by a generalized linear model are overdispersed if the data-level variance is higher than would be predicted by the model.
Whenever we represent inferences for a parameter using a point estimate and standard error, we are performing a data reduction. If the estimate is normally distributed, this summary discards no information because the normal distribution is completely defined by its mean and variance. But in other cases it can be useful to represent the uncertainty in the parameter estimation by a set of random simulations that represent possible values of the parameter vector (with more likely values being more likely to appear in the simulation). By simulation, then, we mean summarizing inferences by random numbers rather than by point estimates and standard errors.
Simulation of probability models
In this section we introduce simulation for two simple probability models. The rest of the chapter discusses how to use simulations to summarize and understand regressions and generalize linear models, and the next chapter applies simulation to model checking and validation. Simulation is important in itself and also prepares for multilevel models, which we fit using simulation-based inference, as described in Part 2B.
A simple example of discrete predictive simulation
How many girls in 400 births? The probability that a baby is a girl or boy is 48.8% or 51.2%, respectively. Suppose that 400 babies are born in a hospital in a given year. How many will be girls?
It is not always appropriate to fit a classical linear regression model using data in their raw form. As we discuss in Sections 4.1 and 4.4, linear and logarithmic transformations can sometimes help in the interpretation of the model. Nonlinear transformations of the data are sometimes necessary to more closely satisfy additivity and linearity assumptions, which in turn should improve the fit and predictive power of the model. Section 4.5 presents some other univariate transformations that are occasionally useful. We have already discussed interactions in Section 3.3, and in Section 4.6 we consider other techniques for combining input variables.
Linear transformations
Linear transformations do not affect the fit of a classical regression model, and they do not affect predictions: the changes in the inputs and the coefficients cancel in forming the predicted value Xβ. However, well-chosen linear transformation can improve interpretability of coefficients and make a fitted model easier to understand. We saw in Chapter 3 how linear transformations can help with the interpretation of the intercept; this section provides examples involving the interpretation of the other coefficients in the model.
Scaling of predictors and regression coefficients. The regression coefficient βj represents the average difference in y comparing units that differ by 1 unit on the jth predictor and are otherwise identical. In some cases, though, a difference of 1 unit on the x-scale is not the most relevant comparison.
Not only in the commercial world but in the realm of ideas as well, our age is holding a veritable clearance sale. Everything is had so dirt cheap that it is doubtful whether in the end anyone will bid. Every speculative score-keeper who conscientiously keeps account of the momentous march of modern philosophy, every lecturer, tutor, student, every outsider and insider in philosophy does not stop at doubting everything but goes further. Perhaps it would be inappropriate and untimely to ask them where they are actually going, but it is surely polite and modest to take it for granted that they have doubted everything, since otherwise it would certainly be peculiar to say that they went further. All of them then have made this preliminary movement, and presumably so easily that they do not find it necessary to drop a hint about how, for not even the one who anxiously and worriedly sought a little enlightenment found so much as an instructive tip or a little dietary prescription on how to conduct oneself under this enormous task. “But Descartes has done it, hasn't he?” Descartes, a venerable, humble, honest thinker whose writings surely no one can read without the deepest emotion, has done what he has said and said what he has done. Alas! Alas! Alas! That is a great rarity in our age! As he himself reiterates often enough, Descartes did not doubt with respect to faith.
Fear and Trembling, written when the author was only thirty years old, is in all likelihood Søren Kierkegaard's most-read book. This would not have surprised Kierkegaard, who wrote prophetically in his journal that “once I am dead, Fear and Trembling alone will be enough for an imperishable name as an author. Then it will [be] read, translated into foreign languages as well.” In one sense the book is not difficult to read. It is often assigned in introductory university classes, for it is the kind of book that a novice in philosophy can pick up and read with interest and profit – stimulating questions about ethics and God, faith and reason, experience and imagination. However, in another sense the book is profoundly difficult, the kind of book that can be baffling to the scholar who has read it many times and studied it for years – giving rise to a bewildering variety of conflicting interpretations.
Many of these interpretations have focused on the book's relation to Kierkegaard's own life, and in particular on the widely known story of Kierkegaard's broken engagement to Regine Olsen. There is little doubt that part of Kierkegaard's own motivation for writing Fear and Trembling was to present a disguised explanation to Regine of his true reasons for breaking off the engagement. However, it is just as certain that the philosophical importance of the book does not depend on these personal and biographical points; the book can be read and has been read with profit by those with no knowledge of Kierkegaard's own life.
On one occasion when the price of spices in Holland became somewhat slack, the merchants let a few loads be dumped at sea in order to drive up the price. This was a pardonable, perhaps a necessary stratagem. Do we need something similar in the world of spirit? Are we so sure of having attained the highest that there is nothing left to do except piously to delude ourselves that we have not come so far in order still to have something with which to fill the time? Does the present generation need such a self-deception? Should a virtuosity in this be cultivated in it, or is it not rather sufficiently perfected in the art of self-deception? Or is what it needs not rather an honest earnestness that fearlessly and incorruptibly calls attention to the tasks, an honest earnestness that lovingly preserves the tasks, that does not make people anxiously want to rush precipitously to the highest but keeps the tasks young, beautiful, delightful to look upon, and inviting to all, yet also difficult and inspiring for the noble-minded (for the noble nature is inspired only by the difficult)? Whatever one generation learns from another, no generation learns the genuinely human from a previous one. In this respect, every generation begins primitively, has no other task than each previous generation, and advances no further, provided the previous generation has not betrayed the task and deceived itself.