We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter we show that the consistency problem can be hard for some very simple feed-forward neural networks. In Section 25.2, we show that, for certain graded spaces of feed-forward linear threshold networks with binary inputs, the consistency problem is NP-hard. This shows that for each such family of networks, unless RP = NP, there can be no efficient learning algorithm in the restricted learning model and hence, in particular, no efficient learning algorithm in the standard model of Part 1. These networks are somewhat unusual in that the output unit is constrained to compute a conjunction. In Section 25.3, we extend the hardness result to networks with an arbitrary linear threshold output unit, but with real inputs. In Section 25.4, we describe similar results for graded classes of feed-forward sigmoid networks with linear output units, showing that approximately minimizing sample error is NP-hard for these classes. Unless RP = NP, this shows that there can be no efficient learning algorithm in the restricted learning model of Part 3.
Linear Threshold Networks with Binary Inputs
For each positive integer n, we define a neural network on n inputs as follows. The network has n binary inputs and k + 1 linear threshold units (k ≥ 1). It has two layers of computation units, the first consisting of k linear threshold units, each connected to all of the inputs.
The general upper and lower bounds on sample complexity described in Chapters 4 and 5 show that the VC-dimension determines the sample complexity of the learning problem for a function class H. The results of Chapters 6 and 8 show that, for a variety of neural networks, the VC-dimension grows with the number of parameters. In particular, the lower bounds on the VC-dimension of neural networks described in Section 6.3, together with Theorem 6.5, show that with mild conditions on the architecture of a multi-layer network and the activation functions of its computation units, the VC-dimension grows at least linearly with the number of parameters.
These results do not, however, provide a complete explanation of the sample size requirements of neural networks for pattern classification problems. In many applications of neural networks the network parameters are adjusted on the basis of a small training set, sometimes an order of magnitude smaller than the number of parameters. In this case, we might expect the network to ‘overfit’, that is, to accurately match the training data, but predict poorly on subsequent data. Indeed, the results from Part 1 based on the VC-dimension suggest that the estimation error could be large, because VCdim(H)/m is large. Nonetheless, in many such situations these networks seem to avoid overfitting, in that the training set error is a reliable estimate of the error on subsequent examples. Furthermore, Theorem 7.1 shows that an arbitrarily small modification to the activation function can make the VC-dimension infinite, and it seems unnatural that such a change should affect the statistical behaviour of networks in applications.
The previous chapter gave a formal definition of the learning problem, and showed that it can be solved if the class HN of functions is finite. However, many interesting function classes are not finite. For example, the number of functions computed by the perceptron with real-valued weights and inputs is infinite. Many other neural networks can also be represented as a parameterized function class with an infinite parameter set. We shall see that learning is possible for many (but not all) function classes like this, provided the function class is not too complex. In this chapter, we examine two measures of the complexity of a function class, the growth function and the VC-dimension, and we show that these are intimately related. In the next two chapters, we shall see that the growth function and VC-dimension of a function class determine the inherent sample complexity of the learning problem.
The Growth Function
Consider a finite subset S of the input space X. For a function class H, the restriction of H to the set S (that is, the set of restrictions to S of all functions in H) is denoted by H|s. If H|s is the set of all functions from S to {0, 1}, then clearly, H is as powerful as it can be in classifying the points in S. We can view the cardinality of H|s (and in particular how it compares with 2|s|) as a measure of the classification complexity of H with respect to the set S.
Chapters 4 and 5 show that the Vapnik-Chervonenkis dimension is crucial in characterizing learnability by binary-output networks, and that it can be used to bound the growth function. Chapter 10 shows that covering numbers are a generalization of the growth function useful for analysing classification by real-output neural networks (or, more generally, by real-valued function classes). We see later in the book that covering numbers are also important in analysing other models of learning. It is natural to ask whether there is a ‘combinatorial’ measure analogous to the VC-dimension that can be used to bound the covering numbers of a class of real-valued functions, and hence to quantify the sample complexity of classification learning. This is largely true, although the definitions and proofs are more complicated than for the binary classification case. In this chapter we introduce the key ‘dimensions’ that we use in our analysis of learning with real function classes and establish some associated basic results and useful properties. In the next chapter we show how these dimensions may be used to bound the covering numbers.
The Pseudo-Dimension
The definition of the pseudo-dimension
To introduce the first of the new dimensions, we first present a slightly different formulation of the definition of the VC-dimension. For a set of functions H mapping from X to {0, 1}, recall that a subset S = {x1, x2, … xm} of X is shattered by H if H|s has cardinality 2m.
In this chapter, and many subsequent ones, we deal with feed-forward neural networks. Initially, we shall be particularly concerned with feed-forward linear threshold networks, which can be thought of as combinations of perceptrons.
To define a neural network class, we need to specify the architecture of the network and the parameterized functions computed by its components. In general, a feed-forward neural network has as its main components a set of computation units, a set of input units, and a set of connections from input or computation units to computation units. These connections are directed; that is, each connection is from a particular unit to a particular computation unit. The key structural property of a feed-forward network—the feed-forward condition—is that these connections do not form any loops. This means that the units can be labelled with integers in such a way that if there is a connection from the unit labelled i to the computation unit labelled j then i < j.
Associated with each unit is a real number called its output. The output of a computation unit is a particular function of the outputs of units that are connected to it. The feed-forward condition guarantees that the outputs of all units in the network can be written as an explicit function of the network inputs.
This book is about the use of artificial neural networks for supervised learning problems. Many such problems occur in practical applications of artificial neural networks. For example, a neural network might be used as a component of a face recognition system for a security application. After seeing a number of images of legitimate users' faces, the network needs to determine accurately whether a new image corresponds to the face of a legitimate user or an imposter. In other applications, such as the prediction of future price of shares on the stock exchange, we may require a neural network to model the relationship between a pattern and a real-valued quantity.
In general, in a supervised learning problem, the learning system must predict the labels of patterns, where the label might be a class label or a real number. During training, it receives some partial information about the true relationship between patterns and their labels in the form of a number of correctly labelled patterns. For example, in the face recognition application, the learning system receives a number of images, each labelled as either a legitimate user or an imposter. Learning to accurately label patterns from training data in this way has two major advantages over designing a hard-wired system to solve the same problem: it can save an enormous amount of design effort, and it can be used for problems that cannot easily be specified precisely in advance, perhaps because the environment is changing.
In this chapter, we consider learning algorithms for classes F of realvalued functions that can be expressed as convex combinations of functions from some class G of basis functions. The key example of such a class is that of feed-forward networks with a linear output unit in which the sum of the magnitudes of the output weights is bounded by some constant B. In this case, the basis function class G is the set of functions that can be computed by any non-output unit in the network, and their negations, scaled by B. We investigate two algorithms. Section 26.2 describes Construct, an algorithm for the real prediction problem, and Section 26.3 describes Adaboost, an algorithm for the restricted version of the real classification problem. Both algorithms use a learning algorithm for the basis function class to iteratively add basis functions to a convex combination, leaving previous basis functions fixed.
Real Estimation with Convex Combinations of Basis Functions
Theorem 14.10 (Section 14.4) shows that any convex combination of bounded basis functions can be accurately approximated (with respect to the distance dL2(P), for instance) using a small convex combination. This shows that the approximate-SEM problem for the class co(G) can be solved by considering only small convex combinations of functions from G. In fact, the problem can be simplified even further. The following theorem shows that we can construct a small convex combination in an iterative way, by greedily minimizing error as each basis function is added.
In the previous chapters we showed that a class of functions of finite VC-dimension is learnable by the fairly natural class of SEM algorithms, and we provided bounds on the estimation error and sample complexity of these learning algorithms in terms of the VC-dimension of the class. In this chapter we provide lower bounds on the estimation error and sample complexity of any learning algorithm. These lower bounds are also in terms of the VC-dimension, and are not vastly different from the upper bounds of the previous chapter. We shall see, as a consequence, that the VC-dimension not only characterizes learnability, in the sense that a function class is learnable if and only if it has finite VC-dimension, but it provides precise information about the number of examples required.
A Lower Bound for Learning
A technical lemma
The first step towards a general lower bound on the sample complexity is the following technical lemma, which will also prove useful in later chapters. It concerns the problem of estimating the parameter describing a Bernoulli random variable.
Lemma 5.1Suppose that α is a random variable uniformly distributed on {α−, α+}, where α− = 1/2 − ∈/2 and α+ = 1/2 + ∈/2, with 0 < ∈ < 1. Suppose that ξ1, …, ξmare i.i.d. (independent and identically distributed) {0, 1}-valued random variables with Pr(ξi = 1) = α for all i. Let f be a function from {0, 1}mto {α−, α+}.
In this part of the book, we turn our attention to aspects of the time complexity, or computational complexity of learning. Until now we have discussed only the sample complexity of learning, and we have been using the phrase ‘learning algorithm’ without any reference to algorithmics. But issues of running time are crucial. If a learning algorithm is to be of practical value, it must, first, be possible to implement the learning algorithm on a computer; that is, it must be computable and therefore, in a real sense, an algorithm, not merely a function. Furthermore, it should be possible to produce a good output hypothesis ‘quickly’.
One subtlety that we have not so far explicitly dealt with is that a practical learning algorithm does not really output a hypothesis; rather, it outputs a representation of a hypothesis. In the context of neural networks, such a representation consists of a state of the network; that is, an assignment of weights and thresholds. In studying the computational complexity of a learning algorithm, one therefore might take into account the ‘complexity’ of the representation output by the learning algorithm. However, this will not be necessary in the approach taken here. For convenience, we shall continue to use notation suggesting that the output of a learning algorithm is a function from a class of hypotheses, but the reader should be aware that, formally, the output is a representation of such a function.