To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The greatest value of a picture is when it forces us to notice what we never expected to see.
John Tukey (1977), statistician and pioneer of exploratory data analysis
How should you summarise a dataset? This is what descriptive statistics and statistical graphics are for. A statistic is just a number computed from a data sample. Descriptive statistics provide a means for summarising the properties of a sample of data (many numbers or values) so that the most important results can be communicated effectively (using few numbers). Numerical and graphical methods, including descriptive statistics, are used in exploratory data analysis (EDA) to simplify the uninteresting and reveal the exceptional or unexpected in data.
Plotting data
One of the basic principles of good data analysis is: always plot the data. The brain–eye system is incredibly good at recognising patterns, identifying outliers and seeing the structure in data. Visualisation is an important part of data analysis, and when confronted with a new dataset the first step in the analysis should be to plot the data. There is a wide array of different types of statistical plot useful in data analysis, and it is important to use a plot type appropriate to the data type. Graphics are usually produced for screen or paper and so are inherently two dimensional, even if the data are not.
The variables can often be classified as explanatory or response.
Everything should be made as simple as possible, but not simpler.
Attributed to Einstein
We can use what we have learnt to start making some inferences about data. Maybe we have collected measurements of a quantity and wish to see if these are consistent with some theoretical expectation. We don't just want to compute the sample mean but to compare it with something else. Perhaps we have two samples, taken under different conditions (such as a ‘treatment’ and ‘control’ group) and wish to see if their mean responses differ. Another very common situation is that we have measurements of some response (y) taken at different values of some explanatory variable (x) and wish to quantify the way that y responds. We can go some way to getting useful inferences out of such data using numerical and graphical summaries (Chapter 2). These can be refined once we have studied some probability theory (Chapters 4 and 5).
Inference about the mean of a sample
We take repeated measurements of a single quantity, or measure the same quantity for each member of a finite sample, and wish to discover whether these data are consistent with a predetermined theoretical value. We want to know if our sample is consistent with being randomly drawn from a theoretical population, with some particular population mean. As an example, let's consider the first ‘experiment’ (batch of 20 runs) of Michelson's dataset (see Appendix B, section B.1).
Dotted throughout the book are extracts of computer code that show how to perform the calculations under discussion. The examples are based on specific problems discussed in the text, but should be clear enough that they can also be used, with very little effort, for ‘real life’ data analysis problems. The computer codes are written in the R environment, which is introduced in this appendix.
What is R?
R is an environment for statistical computation and data analysis. You can think of it as a suite of software for manipulating data, producing plots and performing calculations, with a very wide range of powerful statistical tools. But it is also a programming language, so you can construct your own analyses with a little programming effort. It is one of the standard packages used by statisticians (professional and academic). To install R visit www.r-project.org/.
A first R session
First of all, start R, either by typing R at the command prompt (e.g. Linux) or double-clicking on the relevant icon (e.g. Windows).
A typical R session involves typing some commands into a ‘console’ window, and viewing the text and/or graphical output (which may appear in a pop-out window). The prompt is usually a ‘>’ sign, but can be changed if desired. At the prompt you can enter commands to execute.Virtually all commands in R have a command(arguments) format, where the name of the command is followed by some arguments enclosed in brackets (if there are no arguments the brackets are still present but empty).
It is remarkable that a science which began with the consideration of games of chance should have become the most important object of human knowledge.
Pierre-Simon Laplace (1812) Théorie Analytique des Probabilités
Why should a scientist bother with statistics? Because science is about dealing rigorously with uncertainty, and the tools to accomplish this are statistical. Statistics and data analysis are an indispensable part of modern science.
In scientific work we look for relationships between phenomena, and try to uncover the underlying patterns or laws. But science is not just an ‘armchair’ activity where we can make progress by pure thought. Our ideas about the workings of the world must somehow be connected to what actually goes on in the world. Scientists perform experiments and make observations to look for new connections, test ideas, estimate quantities or identify qualities of phenomena. However, experimental data are never perfect. Statistical data analysis is the set of tools that helps scientists handle the limitations and uncertainties that always come with data. The purpose of statistical data analysis is insight not just numbers. (That's why the book is called Scientific Inference and not something more like Statistics for Physics.)
Scientific method
Broadly speaking, science is the investigation of the physical world and its phenomena by experimentation. There are different schools of thought about the philosophy of science and the scientific method, but there are some elements that almost everyone agrees are components of the scientific method.