To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In Chapter 11, we discussed models that were structured as chains or trees. In this chapter, we consider models that associate a label with each pixel of an image. Since the unknown quantities are defined on the pixel lattice, models defined on a grid structure are appropriate. In particular, we will consider graphical models in which each label has a direct probabilistic connection to each of its four neighbors. Critically, this means that there are loops in the underlying graphical model and so the dynamic programming and belief propagation approaches of the previous chapter are no longer applicable.
These grid models are predicated on the idea that the pixel provides only very ambiguous information about the associated label. However, certain spatial configurations of labels are known to be more common than others, and we aim to exploit this knowledge to resolve the ambiguity. In this chapter, we describe the relative preference for different configurations of labels with a pairwise Markov random field or MRF. As we shall see, maximum a posteriori inference for pairwise MRFs is tractable in some circumstances using a family of approaches known collectively as graph cuts.
To motivate the grid models, we introduce a representative application. In image denoising we observe a corrupted image in which the intensities at a certain proportion of pixels have been randomly changed to another value according to a uniform distribution (Figure 12.1).
In this chapter, we consider a pinhole camera viewing a plane in the world. In these circumstances, the camera equations simplify to reflect the fact that there is a one-to-one mapping between points on this plane and points in the image.
Mappings between the plane and the image can be described using a family of 2D geometric transformations. In this chapter, we characterize these transformations and show how to estimate their parameters from data. We revisit the three geometric problems from Chapter 14 for the special case of a planar scene.
To motivate the ideas of this chapter, consider an augmented reality application in which we wish to superimpose 3D content onto a planar marker (Figure 15.1). To do this, we must establish the rotation and translation of the plane relative to the camera. We will do this in two stages. First, we will estimate the 2D transformation between points on the marker and points in the image. Second, we will extract the rotation and translation from the transformation parameters.
Two-dimensional transformation models
In this section, we consider a family of 2D transformations, starting with the simplest and working toward the most general. We will motivate each by considering viewing a planar scene under different viewing conditions.
Euclidean transformation model
Consider a calibrated camera viewing a fronto-parallel plane at known distance, D (i.e., a plane whose normal corresponds to the ω-axis of the camera).
In the second part of this book (chapters 6–9), we treat vision as a machine learning problem and disregard everything we know about the creation of the image. For example, we will not exploit our understanding of perspective projection or light transport. Instead, we treat vision as pattern recognition; we interpret new image data based on prior experience of images in which the contents were known. We divide this process into two parts: in learning we model the relationship between the image data and the scene content. In inference, we exploit this relationship to predict the contents of new images.
To abandon useful knowledge about image creation may seem odd, but the logic is twofold. First, these same learning and inference techniques will also underpin our algorithms when image formation is taken into account. Second, it is possible to achieve a great deal with a pure learning approach to vision. For many tasks, knowledge of the image formation process is genuinely unnecessary.
The structure of Part II is as follows. In Chapter 6 we present a taxonomy of models that relate the measured image data and the actual scene content. In particular, we distinguish between generative models and discriminative models. For generative models, we build a probability model of the data and parameterize it by the scene content. For discriminative models, we build a probability model of the scene content and parameterize it by the data.
This chapter introduces the pinhole or projective camera. This is a purely geometric model that describes the process whereby points in the world are projected into the image. Clearly, the position in the image depends on the position in the world, and the pinhole camera model captures this relationship.
To motivate this model, we will consider the problem of sparse stereo reconstruction (Figure 14.1). We are given two images of a rigid object taken from different positions. Let us assume that we can identify corresponding 2D features between the two images – points that are projected versions of the same position in the 3D world. The goal now is to establish this 3D position using the observed 2D feature points. The resulting 3D information could be used by a robot to help it navigate through the scene, or to facilitate object recognition.
The pinhole camera
In real life, a pinhole camera consists of a chamber with a small hole (the pinhole) in the front (Figure 14.2). Rays from an object in the world pass through this hole to form an inverted image on the back face of the box, or image plane. Our goal is to build a mathematical model of this process.
It is slightly inconvenient that the image from the pinhole camera is upside-down. Hence, we instead consider the virtual image that would result from placing the image plane in front of the pinhole.
This is a brief guide to the notational conventions used in this text.
Scalars, vectors, and matrices
We denote scalars by either small or capital letters α, A α. We denote column vectors by bold small letters a, ϕ. When we need a row vector we usually present this as the transpose of a column vector aT, ϕT.
We represent matrices by bold capital letters B, Φ. The ith row and jth column of matrix A is written as aij. The jth column of matrix A is written as aj. When we need to refer to the ith row of a matrix, we write this as ai • where the bullet • indicates that we are considering all possible values of the column index.
We concatenate two D × 1 column vectors horizontally as a = [b, c] to form the D × 2 matrix A. We concatenate two D × 1 column vectors vertically as a = [bT,cT]T to form the 2D × 1 vector a. Although this notation is cumbersome, it allows us to represent vertical concatenations within a single line of text.
Variables and parameters
We denote variables with Roman letters a, b. The most common examples are the observed data which is always denoted by x and the state of the world which is always denoted by w. However, other hidden or latent variables are also represented by Roman letters. We denote parameters of the model by Greek letters μ, Φ, σ2. These are distinguished from variables in that there is usually a single set of parameters that explains the relation between many sets of variables.
This chapter concerns discriminative models for classification. The goal is to directly model the posterior probability distribution Pr(ω|x) over a discrete world state ω ϵ {1,…K} given the continuous observed data vector x. Models for classification are very closely related to those for regression and the reader should be familiar with the contents of Chapter 8 before proceeding.
To motivate the models in this chapter, we will consider gender classification: here we observe a 60 × 60 RGB image containing a face (Figure 9.1) and concatenate the RGB values to form the 10800×1 vector x. Our goal is to take the vector x and return the probability distribution Pr(ω|x) over a label ω ϵ {0,1} indicating whether the face is male (ω = 0) or female (ω = 1).
Gender classification is a binary classification task as there are only two possible values of the world state. Throughout most of this chapter, we will restrict our discussion to binary classification. We discuss how to extend these models to cope with an arbitrary number of classes in Section 9.9.
Logistic regression
We will start by considering logistic regression, which despite its name is a model that can be applied to classification. Logistic regression (Figure 9.2) is a discriminative model; we select a probability distribution over the world state ω ϵ {0,1} and make its parameters contingent on the observed data x.
At an abstract level, the goal of computer vision problems is to use the observed image data to infer something about the world. For example, we might observe adjacent frames of a video sequence and infer the camera motion, or we might observe a facial image and infer the identity.
The aim of this chapter is to describe a mathematical framework for solving this type of problem and to organize the resulting models into useful subgroups, which will be explored in subsequent chapters.
Computer vision problems
In vision problems, we take visual data x and use them to infer the state of the world w. The world state w may be continuous (the 3D pose of a body model) or discrete (the presence or absence of a particular object). When the state is continuous, we call this inference process regression. When the state is discrete, we call it classification.
Unfortunately, the measurements x may be compatible with more than one world state w. The measurement process is noisy, and there is inherent ambiguity in visual data: a lump of coal viewed under bright light may produce the same luminance measurements as white paper in dim light. Similarly, a small object seen close-up may produce the same image as a larger object that is further away.
In the face of such ambiguity, the best that we can do is to return the posterior probability distribution Pr(w|x) over possible states w.
The first part of this book (Chapters 2–5) is devoted to a brief review of probability and probability distributions. Almost all models for computer vision can be interpreted in a probabilistic context, and in this book we will present all the material in this light. The probabilistic interpretation may initially seem confusing, but it has a great advantage: it provides a common notation that will be used throughout the book and will elucidate relationships between different models that would otherwise remain opaque.
So why is probability a suitable language to describe computer vision problems? In a camera, the three-dimensional world is projected onto the optical surface to form the image: a two-dimensional set of measurements. Our goal is to take these measurements and use them to establish the properties of the world that created them. However, there are two problems. First, the measurement process is noisy; what we observe is not the amount of light that fell on the sensor, but a noisy estimate of this quantity. We must describe the noise in these data, and for this we use probability. Second, the relationship between world and measurements is generally many to one: there may be many real-world configurations that are compatible with the same measurements. The chance that each of these possible worlds is present can also be described using probability.
The structure of Part I is as follows: in Chapter 2, we introduce the basic rules for manipulating probability distributions including the ideas of conditional and marginal probability and Bayes' rule.
The previous chapters discussed models that relate the observed measurements to some aspect of the world that we wish to estimate. In each case, this relationship depended on a set of parameters and for each model we presented a learning algorithm that estimated these parameters.
Unfortunately, the utility of these models is limited because every element of the model depends on every other. For example, in generative models we model the joint probability of the observations and the world state. In many problems both of these quantities may be high-dimensional. Consequently, the number of parameters required to characterize their joint density accurately is very large. Discriminative models suffer from the same pathology: if every element of the world state depends on every element of the data, a large number of parameters will be required to characterize this relationship. In practice, the required amount of training data and the computational burden of learning and inference reach impractical levels.
The solution to this problem is to reduce the dependencies between variables in the model by identifying (or asserting) some degree of redundancy. To this end, we introduce the idea of conditional independence, which is a way of characterizing these redundancies. We then introduce graphical models which are graph-based representations of the conditional independence relations. We discuss two different types of graphical models – directed and undirected – and we consider the implications for learning, inference, and drawing samples.
The models in chapters 6–9 describe the relationship between a set of measurements and the world state. They work well when the measurements and the world state are both low dimensional. However, there are many situations where this is not the case, and these models are unsuitable.
For example, consider the semantic image labeling problem in which we wish to assign a label that denotes the object class to each pixel in the image. For example, in a road scene we might wish to label pixels as ‘road’, ‘sky’, ‘car’, ‘tree’, ‘building’ or ‘other’. For an image with N = 10000 pixels, this means we need to build a model relating the 10000 measured RGB triples to 610000 possible world states. None of the models discussed so far can cope with this challenge: the number of parameters involved (and hence the amount of training data and the computational requirements of the learning and inference algorithms) is far beyond what current machines can handle.
One possible solution to this problem would be to build a set of independent local models: for example, we could build models that relate each pixel label separately to the nearby RGB data. However, this is not ideal as the image may be locally ambiguous. For example, a small blue image patch might result from a variety of semantically different classes: sky, water, a car door or a person's clothing.