We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Introduction to Probability and Statistics for Data Science provides a solid course in the fundamental concepts, methods and theory of statistics for students in statistics, data science, biostatistics, engineering, and physical science programs. It teaches students to understand, use, and build on modern statistical techniques for complex problems. The authors develop the methods from both an intuitive and mathematical angle, illustrating with simple examples how and why the methods work. More complicated examples, many of which incorporate data and code in R, show how the method is used in practice. Through this guidance, students get the big picture about how statistics works and can be applied. This text covers more modern topics such as regression trees, large scale hypothesis testing, bootstrapping, MCMC, time series, and fewer theoretical topics like the Cramer-Rao lower bound and the Rao-Blackwell theorem. It features more than 250 high-quality figures, 180 of which involve actual data. Data and R are code available on our website so that students can reproduce the examples and do hands-on exercises.
In Chapter 3 we learned how to do basic probability calculations and even put them to use solving some fairly complicated probability problems. In this chapter and the next two, we generalize how we do probability calculations, where we will transition from working with sets and events to working with random variables.
To do statistics you must first be able to “speak probability.” In this chapter we are going to concentrate on the basic ideas of probability. In probability, the mechanism that generates outcomes is assumed known and the problems focus on calculating the chance of observing particular types or sets of outcomes. Classical problems include flipping “fair” coins (where fair means that on one flip of the coin the chance it comes up heads is equal to the chance it comes up tails) and “fair” dice (where fair now means the chance of landing on any side of the die is equal to that of landing on any other side).
In Chapter 5 we learned about a number of discrete distributions. In this chapter we focus on continuous distributions, which are useful as models of various real-world events. By the end of this chapter you will know nine continuous and eight discrete distributions. There are many more continuous distributions, but these nine will suffice for our purposes. These continuous distributions are useful for modeling various types of processes and phenomena that are encountered in the real world.
Sampling joke: “If you don’t believe in random sampling, the next time you have a blood test, tell the doctor to take it all.” At the beginning of Chapter 7 we introduced the ideas of population vs. sample and parameter vs. statistic. We build on this in the current chapter. The key concept in this chapter is that if we were to take different samples from a distribution and compute some statistic, such as the sample mean, then we would get different results.
The last two chapters have covered the basic concepts of estimation. In Chapter 9 we studied the problem of giving a single number to estimate a parameter. In Chapter 10 we looked at ways to give an interval that we believe will include the true parameter. In many applications, we want to ask some very specific questions about the parameter(s).
We begin this chapter with a review of hypothesis testing from Chapter 12. A hypothesis is a statement about one or more parameters of a model. The null hypothesis is usually a specific statement that encapsulates “no effect.” For example, if we apply one of the two treatments, A or B, to volunteers we may be interested in testing whether the population mean outcomes are equal.
Up to this point we have been talking about what are often called frequentist methods, because a statistical method is based on properties of its long-run relative frequency. With this approach, the probability of an event is defined as the proportion of times the event occurs in the long run. Parameters, that is values that characterize a distribution, such as the mean and variance of a normal distribution, are considered fixed but unknown.
We are often interested in how one or more predictor variables are associated with some outcome or response. We might postulate that the outcome depends on the predictors through some function.
In statistics, we are often interested in some characteristics of a population. Maybe we are interested in the mean of some measurable characteristic, or maybe we are interested in the proportion of the population that have some property. In all but the simplest cases, the population is so large that it is impossible, or at least impractical, to take the measurement on every item in the population. We therefore have to settle on taking a sample and measuring those units selected for this sample.
Forecasting is an important problem that spans many fields, including business and industry, government, economics, environmental sciences, medicine, social science, politics, and finance. Forecasting problems are often classified as short term, medium term, and long term. Short-term forecasting problems involve predicting events only a few time periods (days, weeks, and months) into the future. Medium-term forecasts extend from 1 to 2 years into the future, and long-term forecasting problems can extend beyond that by many years.
Often we look at the relationships between categorical variables, such as which hospital a patient is admitted to, or whether a person has diabetes, pre-diabetes, or no diabetes at all. These variables can be nominal (like the hospital) or ordinal (like diabetes, pre-diabetes, or no diabetes). In many cases we want to know something about how these variables are related.
In the 1990s and before, most of the world’s information was stored on paper and other analog media, such as film. However, with the proliferation of personal computers and the internet, by 2000 one-quarter of the world’s information was stored digitally. Since that time, the amount of digital data has exploded, roughly doubling every couple of years, so that now more than 98% of all stored information is digital.