To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The main focus of this book is on statistical models for computer vision; the previous chapters concern models that relate visual measurements x to the world w. However, there has been little discussion of how the measurement vector x was created, and it has often been implied that it contains concatenated RGB pixel values. In state-of-the-art vision systems, the image pixel data are almost always preprocessed to form the measurement vector.
We define preprocessing to be any transformation of the pixel data prior to building the model that relates the data to the world. Such transformations are often ad hoc heuristics: their parameters are not learned from training data, but they are chosen based on experience of what works well. The philosophy behind image preprocessing is easy to understand; the image data may be contingent on many aspects of the real world that do not pertain to the task at hand. For example, in an object detection task, the RGB values will change depending on the camera gain, illumination, object pose and particular instance of the object. The goal of image preprocessing is to remove as much of this unwanted variation as possible while retaining the aspects of the image that are critical to the final decision.
In a sense, the need for preprocessing represents a failure; we are admitting that we cannot directly model the relationship between the RGB values and the world state. Inevitably, we must pay a price for this.
This chapter concerns regression problems: the goal is to estimate a univariate world state ω based on observed measurements x. The discussion is limited to discriminative methods in which the distribution Pr(ω|x) of the world state is directly modeled. This contrasts with Chapter 7 where the focus was on generative models in which the likelihood Pr(x|ω) of the observations is modeled.
To motivate the regression problem, consider body pose estimation: here the goal is to estimate the joint angles of a human body, based on an observed image of the person in an unknown pose (Figure 8.1). Such an analysis could form the first step toward activity recognition.
We assume that the image has already been preprocessed and a low-dimensional vector x that represents the shape of the contour has been extracted. Our goal is to use this data vector to predict a second vector containing the joint angles for each of the major body joints. In practice, we will estimate each joint angle separately; we can hence concentrate our discussion on how to estimate a univariate quantity ω from continuous observed data x. We begin by assuming that the relation between the world and the data is linear and that the uncertainty around this prediction is normally distributed with constant variance. This is the linear regression model.
Linear regression
The goal of linear regression is to predict the posterior distribution Pr(ω|x) over the world state ω based on observed data x.
There are already many computer vision textbooks, and it is reasonable to question the need for another. Let me explain why I chose to write this volume.
Computer vision is an engineering discipline; we are primarily motivated by the real-world concern of building machines that see. Consequently, we tend to categorize our knowledge by the real-world problem that it addresses. For example, most existing vision textbooks contain chapters on object recognition and stereo vision. The sessions at our research conferences are organized in the same way. The role of this book is to question this orthodoxy: Is this really the way that we should organize our knowledge?
Consider the topic of object recognition. A wide variety of methods have been applied to this problem (e.g., subspace models, boosting methods, bag of words models, and constellation models). However, these approaches have little in common. Any attempt to describe the grand sweep of our knowledge devolves into an unstructured list of techniques. How can we make sense of it all for a new student? I will argue for a different way to organize our knowledge, but first let me tell you how I see computer vision problems.
We observe an image and from this we extract measurements. For example, we might use the RGB values directly or we might filter the image or perform some more sophisticated preprocessing. The vision problem or goal is to use the measurements to infer the world state.
In Part V, we finally acknowledge the process by which real-world images are formed. Light is emitted from one or more sources and travels through the scene, interacting with the materials via physical processes such as reflection, refraction, and scattering. Some of this light enters the camera and is measured. We have a very good understanding of this forward model. Given known geometry, light sources, and material properties, computer graphics techniques can simulate what will be seen by the camera very accurately.
The ultimate goal for a vision algorithm would be a complete reconstruction, in which we aim to invert this forward model and estimate the light sources, materials, and geometry from the image. Here, we aim to capture a structural description of the world: we seek an understanding of where things are and to measure their optical properties, rather than a semantic understanding. Such a structural description can be exploited to navigate around the environment or build 3D models for computer graphics.
Unfortunately, full visual reconstruction is very challenging. For one thing, the solution is nonunique. For example, if the light source intensity increases, but the object reflectance decreases commensurately, the image will remain unchanged. Of course, we could make the problem unique by imposing prior knowledge, but even then reconstruction remains difficult; it is hard to effectively parameterize the scene, and the problem is highly non-convex.
In this part of the book, we consider a family of models that approximate both the 3D scene and the observed image with sparse sets of visual primitives (points).
In most of the models in this book, the observed data are treated as continuous. Hence, for generative models the data likelihood is usually based on the normal distribution. In this chapter, we explore generative models that treat the observed data as discrete. The data likelihoods are now based on the categorical distribution; they describe the probability of observing the different possible values of the discrete variable.
As a motivating example for the models in this chapter, consider the problem of scene classification (Figure 20.1). We are given example training images of different scene categories (e.g., office, coastline, forest, mountain) and we are asked to learn a model that can classify new examples. Studying the scenes in Figure 20.1 demonstrates how challenging a problem this is. Different images of the same scene may have very little in common with one another, yet we must somehow learn to identify them as the same. In this chapter, we will also discuss object recognition, which has many of the same characteristics; the appearance of an object such as a tree, bicycle, or chair can vary dramatically from one image to another, and we must somehow capture this variation.
The key to modeling these complex scenes is to encode the image as a collection of visual words, and use the frequencies with which these words occur as the substrate for further calculations. We start this chapter by describing this transformation.
In this chapter we discuss a family of models that explain observed data in terms of several underlying causes. These causes can be divided into three types: the identity of the object, the style in which it is observed, and the remaining variation.
To motivate these models, consider face recognition. For a facial image, the identity of the face (i.e., whose face it is) obviously influences the observed data. However, the style in which the face is viewed is also important. The pose, expression, and illumination are all style elements that might be modeled. Unfortunately, many other things also contribute to the final observed data: the person may have applied cosmetics, put on glasses, grown a beard, or dyed his or her hair. These myriad contributory elements are usually too difficult to model and are hence explained with a generic noise term.
In face recognition tasks, our goal is to infer whether the identities of face images are the same or different. For example, in face verification, we aim to infer a binary variable ω ϵ {0;1}, where ω=0 indicates that the identities differ and ω=1 indicates that they are the same. This task is extremely challenging when there are large changes in pose, illumination, or expression; the change in the image due to style may dwarf the change due to identity (Figure 18.1).
The models in this chapter are generative, so the focus is on building separate density models over the observed image data cases where the faces do and don't have the same identity.
I was very pleased to be asked to write this foreword, having seen snapshots of the development of this book since its inception. I write this having just returned from BMVC 2011, where I found that others had seen draft copies, and where I heard comments like “What amazing figures!”, “It's so comprehensive!”, and “He's so Bayesian!”.
But I don't want you to read this book just because it has amazing figures and provides new insights into vision algorithms of every kind, or even because it's “Bayesian” (although more on that later). I want you to read it because it makes clear the most important distinction in computer vision research: the difference between “model” and “algorithm.” This is akin to the distinction that Marr made with his three-level computational theory, but Prince's two-level distinction is made beautifully clear by his use of the language of probability.
Why is this distinction so important? Well, let us look at one of the oldest and apparently easiest problems in vision: separating an image into “figure” and “ground.” It is still common to hear students new to vision address this problem just as the early vision researchers did, by reciting an algorithm: first I'll use PCA to find the dominant color axis, then I'll generate a grayscale image, then I'll threshold that at some value, then I'll clean up the holes using morphological operators.
This chapter concerns models for 2D and 3D shape. The motivation for shape models is twofold. First, we may wish to identify exactly which pixels in the scene belong to a given object. One approach to this segmentation problem, is to model the outer contour of the object (i.e., the shape) explicitly. Second, the shape may provide information about the identity or other characteristics of the object: it can be used as an intermediate representation for inferring higher-level properties.
Unfortunately, modeling the shape of an object is challenging; we must account for deformations of the object, the possible absence of some parts of the object and even changes in the object topology. Furthermore, the object may be partially occluded, making it difficult to relate the shape model to the observed data.
One possible approach to establishing 2D object shape is to use a bottom-up approach; here, a set of boundary fragments are identified using an edge detector (Section 13.2.1) and the goal is to connect these fragments to form a coherent object contour. Unfortunately, achieving this goal has proved surprisingly elusive. In practice, the edge detector finds extraneous edge fragments that are not part of the object contour and misses others that are part of the true contour. Hence it is difficult to connect the edge fragments in a way that correctly reconstructs the contour of an object.
In the final part of this book, we discuss four families of models. There is very little new theoretical material; these models are straight applications of the learning and inference techniques introduced in the first nine chapters. Nonetheless, this material addresses some of the most important machine vision applications: shape modeling, face recognition, tracking, and object recognition.
In Chapter 17 we discuss models that characterize the shape of objects. This is a useful goal in itself as knowledge of shape can help localize or segment an object. Furthermore, shape models can be used in combination with models for the RGB values to provide a more accurate generative account of the observed data.
In Chapter 18 we investigate models that distinguish between the identities of objects and the style in which they are observed; a prototypical example of this type of application would be face recognition. Here the goal is to build a generative model of the data that can separate critical information about identity from the irrelevant image changes due to pose, expression and lighting.
In Chapter 19 we discuss a family of models for tracking visual objects through time sequences. These are essentially graphical models based on chains such as those discussed in Chapter 11. However, there are two main differences. First, we focus here on the case where the unknown variable is continuous rather than discrete. Second, we do not usually have the benefit of observing the full sequence; we must make a decision at each time based on information from only the past.
In mathematics you don't understand things. You just get used to them.
John von Neumann (1903–57)
Mathematical explanation is a hot topic in current work in the philosophy of mathematics. We have already seen one reason for this: the close connection between the indispensability argument for mathematical realism and the scientific realist's reliance on inference to the best explanation. This connection is even tighter if it can be established that there are mathematical explanations of empirical phenomena. As a result, a great deal of recent work on realism-anti-realism issues in mathematics has focused on mathematical explanations in science. Irrespective of such issues, the question of mathematical explanation is important in its own right and deserves closer attention.
We start by making a distinction between two different senses of mathematical explanation. The first we call intra-mathematical explanations. These are mathematical explanations of mathematical facts. Such explanations can take the form of an explanatory proof – a proof that tells us why the theorem in question is true – or perhaps a recasting of the mathematical fact in question in terms of another area of mathematics. There is also the issue of whether mathematics can explain empirical facts. Call this extra-mathematical explanation. A full account of mathematical explanation will provide both a philosophically satisfying account of intra-mathematical explanation and an account that coheres with our account of explanation elsewhere in science.
Beauty is the first test: there is no permanent place in the world for ugly mathematics.
G. H. Hardy (1877–1947)
You know the old question about which 20 books, 20 albums, 20 movies, or whatever you'd like to have with you if you were stranded on a desert island? Well, in this chapter I'll give you my top 20 mathematical theorems for desert island-bound philosophers. We look at a number of mathematical results that have some philosophical interest, or in some cases are just very cool pieces of mathematics. (Alternatively, you might think of this chapter as 20 theorems you should come to terms with before you die.) Of course, this is just my top 20 theorems. If you don't like my choices, feel free to construct your own list. For good measure I throw in a few famous open problems and interesting numbers to round out my desert-island survival kit.
Philosophers' favourites
The theorems in this section are well known by philosophers and rightly get a great deal of attention in philosophical circles. These are the obvious choices for desert island theorems, but in some cases you'd be disappointed to be stuck with just these. You wouldn't be disappointed because they are uninteresting or trivial; you'd be disappointed because they are just a bit too obvious. Everybody would have these! In any case, the theorems below are the classics – the obvious ones that almost anyone would put high on their list. (These are the Citizen Kanes and Vertigos of the maths world.)
Mathematics may be defined as the subject in which we never know what we are talking about, nor whether what we are saying is true.
Bertrand Russell (1872–1970)
In the last chapter we saw one of the main cases for Platonism, namely, the indispensability argument. In this chapter we look at a few anti-realist philosophies of mathematics. Each of these positions can be understood as a response to the indispensability argument. They are also motivated by the Benacerraf epistemic challenge to Platonism and the hope that it's easier to be rid of troublesome mathematical entities than it is to provide a Platonist epistemology.
Fictionalism
Fictionalism in the philosophy of mathematics is the view that mathematical statements, such as ‘7+5 = 12’ and ‘πis irrational’, are to be interpreted at face value and, thus interpreted, are false. Fictionalists are typically driven to reject the truth of such mathematical statements because these statements imply the existence of mathematical entities, and, according to fictionalists, there are no such entities. Fictionalism is a nominalist (or antirealist) account of mathematics in that it denies the existence of a realm of abstract mathematical entities. It should be contrasted with mathematical realism (or Platonism), where mathematical statements are taken to be true and, moreover, are taken to be truths about mathematical entities. Fictionalism should also be contrasted with other nominalist philosophical accounts of mathematics that propose a reinterpretation of mathematical statements, according to which the statements in question are true but no longer about mathematical entities.