We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
For a fixed set of input examples, one can decompose the N-sphere into cells each consisting of all the perceptron coupling vectors J giving rise to the same classification of those examples. Several aspects of perceptron learning discussed in the preceding chapters are related to the geometric properties of this decomposition, which turns out to have random multifractal properties. Our outline of the mathematical techniques related to the multifractal method will of course be short and ad rem; see [172, 173] for a more detailed introduction. But this alternative description provides a deeper and unified view of the different learning properties of the perceptron. It highlights some of the more subtle aspects of the thermodynamic limit and its role in the statistical mechanics analysis of perceptron learning. In this way we finish our discussion of the perceptron with an encompassing multifractal description, preparing the way for the application of this approach to the analysis of multilayer networks.
The shattered coupling space
Consider a set of p = αN examples ξµ generated independently at random from the uniform distribution on the N-sphere. Each hyperplane perpendicular to one of these inputs cuts the coupling space of a spherical perceptron, which is the very same N-sphere, into two half-spheres according to the two possible classifications of the example.
In this book we have discussed how various aspects of learning in artificial neural networks may be quantified by using concepts and techniques developed in the statistical mechanics of disordered systems. These methods grew out of the desire to understand some strange low-temperature properties of disordered magnets; nevertheless their usefulness for and efficiency in the analysis of a completely different class of complex systems underlines the generality and strength of the principles of statistical mechanics.
In this final chapter we have collected some additional examples of non-physical complex systems for which an analysis using methods of statistical mechanics similar to those employed for the study of neural networks has given rise to new and interesting results. Compared with the previous chapters, the discussions in the present one will be somewhat more superficial – merely pointing to the qualitative analogies with the problems elucidated previously, rather than working out the consequences in full detail. Moreover, some of the problems we consider are strongly linked to information processing and artificial neural networks, whereas others are not. In all cases quenched random variables are used to represent complicated interactions which are not known in detail, and the typical behaviour in a properly defined thermodynamic limit is of particular interest.
Support vector machines
The main reason which prevents the perceptron from being a serious candidate for the solution of many real-world learning problems is that it can only implement linearly separable Boolean functions.
The Gibbs rule discussed in the previous chapter characterizes the typical generalization behaviour of the students forming the version space. It is hence well suited for a general theoretical analysis. For a concrete practical problem it is, however, hardly the best choice and there is a variety of other learning rules which are often more direct and may also show a better performance. The purpose of this chapter is to introduce a representative selection of these learning rules, to discuss some of their features, and to compare their properties with those of the Gibbs rule.
The Hebb rule
The oldest and maybe most important learning rule was introduced by D. Hebb in the late 1940s. It is, in fact, an application at the level of single neurons of the idea of Pavlov coincidence training. In his famous experiment, Pavlov showed how a dog, which was trained to receive its food when, at the same time, a light was being turned on, would also start to salivate when the light alone was lit. In some way, the coincidence of the two events, food and light, had established a connection in the brain of the dog such that, even when only one of the events occurred, the memory of the other would be stimulated. The basic idea behind the Hebb rule [32] is quite similar: strengthen the connection of neurons that fire together.
As a rule teachers are unreliable. From time to time they mix up questions or answer absentmindedly. How much can a student network learn about a target rule if some of the examples in the training set are corrupted by random noise? What is the optimal strategy for the student in this more complicated situation?
To analyse these questions in detail for the two-perceptron scenario is the aim of the present chapter. Let us emphasize that quite generally a certain robustness with respect to random influences is an indispensable requirement for any information processing system, both in biological and in technical contexts. If learning from examples were possible only for perfectly error-free training sets it would be of no practical interest. In fact, since the noise blurring the correct classifications of the teacher may usually be assumed to be independent of the examples, one expects that it will remain possible to infer the rule, probably at the expense of a larger training set.
A general feature of noisy generalization tasks is that the training set is no longer generated by a rule that can be implemented by the student. The problem is said to be unrealizable. A simple example is a training set containing the same input with different outputs, which is quite possible for noisy teachers. This means that for large enough training sets no student exists who is able to reproduce all classifications and the version space becomes empty.
So far we have been considering learning scenarios in which generalization shows up as a gradual process of improvement with the generalization error ε decreasing continuously from its initial pure guessing value ε = 0.5 to the asymptotic limit ε = 0. In the present chapter we study systems which display a quite different behaviour with sudden changes of the generalization ability taking place during the learning process. The reason for this new feature is the presence of discrete degrees of freedom among the parameters, which are adapted during the learning process. As we will see, discontinuous learning is a rather subtle consequence of this discreteness and methods of statistical mechanics are well suited to describe the situation. In particular the abrupt changes which occur in the generalization process can be described as first order phase transitions well studied in statistical physics.
Smooth networks
The learning scenarios discussed so far have been described in the framework of statistical mechanics as a continuous shift of the balance between energetic and entropic terms. In the case of perfect learning the energy describes how difficult it is for the student vector to stay in the version space (see (2.13)). For independent examples it is naturally given as a sum over the training set and scales for large α as e ∼ αε since the generalization error ε gives the probability of error and hence of an additional cost when a new example is presented.
In the present chapter we introduce the basic notions necessary to study learning problems within the framework of statistical mechanics. We also demonstrate the efficiency of learning from examples by the numerical analysis of a very simple situation. Generalizing from this example we will formulate the basic setup of a learning problem in statistical mechanics to be discussed in numerous modifications in later chapters.
Artificial neural networks
The statistical mechanics of learning has been developed primarily for networks of so-called formal neurons. The aim of these networks is to model some of the essential information processing abilities of biological neural networks on the basis of artificial systems with a similar architecture. Formal neurons, the microscopic building blocks of these artificial neural networks, were introduced more than 50 years ago by McCulloch and Pitts as extremely simplified models of the biological neuron [1]. They are bistable linear threshold elements which are either active or passive, to be denoted in the following by a binary variable S = ±1. The state Si of a given neuron i changes with time because of the signals it receives through its synaptic couplings Jij from either the “outside world” or other neurons j.
More precisely, neuron i sums up the incoming activity of all the other neurons weighted by the corresponding synaptic coupling strengths to yield the post-synaptic potential ∑jJij Sj and compares the result with a threshold θi specific to neuron i.
There is an important extreme case of learning from a noisy source as discussed in the previous chapter which deserves special consideration. It concerns the situation of an extremely noisy teacher in which the added noise is so strong that it completely dominates the teacher's output. The task for the student is then to reproduce a mapping with no correlations between input and output so that the notion of a teacher actually becomes obsolete. The central question is how many input–output pairs can typically be implemented by an appropriate choice of the couplings J. This is the so-called storage problem. Its investigation yields a measure for the flexibility of the network under consideration with respect to the implementation of different mappings between input and output.
The reason why we include a discussion of this case in the present book, which is mainly devoted to the generalization behaviour of networks, is threefold. Firstly, there is a historical point: in the physics community the storage properties of neural networks were discussed before emphasis was laid on their ability to learn from examples, and several important concepts have been introduced in connection with these earlier investigations. Secondly, in several situations the storage problem is somewhat simpler to analyse and therefore forms a suitable starting point for the more complicated investigation of the generalization performance. Thirdly, we will see in chapter 10 that the flexibility of a network architecture with respect to the implementation of different input–output relations also gives useful information on its generalization behaviour.
Understanding intelligent behaviour has always been fascinating to both laymen and scientists. The question has become very topical through the concurrence of a number of different issues. First, there is a growing awareness of the computational limits of serial computers, while parallel computation is gaining ground, both technically and conceptually. Second, several new non-invasive scanning techniques allow the human brain to be studied from its collective behaviour down to the activity of single neurons. Third, the increased automatization of our society leads to an increased need for algorithms that control complex machines performing complex tasks. Finally, conceptual advances in physics, such as scaling, fractals, bifurcation theory and chaos, have widened its horizon and stimulate the modelling and study of complex non-linear systems. At the crossroads of these developments, artificial neural networks have something to offer to each of them.
The observation that these networks can learn from examples and are able to discern an underlying rule has spurred a decade of intense theoretical activity in the statistical mechanics community on the subject. Indeed, the ability to infer a rule from a set of examples is widely regarded as a sign of intelligence. Without embarking on a thorny discussion about the nature or definition of intelligence, we just note that quite a few of the problems posed in standard IQ tests are exactly of this nature: given a sequence of objects (letters, pictures, …) one is asked to continue the sequence “meaningfully”, which requires one to decipher the underlying rule.
In the preceding chapters we have described various properties of learning in the perceptron, exploiting the fact that its simple architecture allows a rather detailed mathematical analysis. However, the perceptron suffers from a major deficiency that led to its demise in the late 1960s: being able only to implement linearly separable Boolean functions its computational capacities are rather limited. An obvious generalization is feed-forward multilayer networks with one or more intermediate layers of formal neurons neurons between input and output (cf. fig. 1.1c). On the one hand these may be viewed as being composed of individual perceptrons, so that their theoretical analysis may build on what has been accomplished for the perceptron. On the other hand the addition of internal degrees of freedom makes them computationally much more powerful. In fact multilayer neural networks are able to realize all possible Boolean functions between input and output, which makes them an attractive choice for practical applications. There is also a neurophysiological motivation for the study of multilayer networks since most neurons in biological neural nets are interneurons neither directly connected to sensory inputs nor to motor outputs.
The higher complexity of multilayer networks as compared to the simple perceptron makes the statistical mechanics analysis of their learning abilities more complicated and in general precludes the general and detailed characterization which was possible for the perceptron. Nevertheless, for tailored architectures and suitable learning scenarios very instructive results may be obtained, some of which will be discussed in the present chapter.
The generalization performance of some of the learning rules introduced in the previous chapter could be characterized either by using simple arguments from statistics as in the case of the Hebb rule, or by exploiting our results on Gibbs learning obtained in chapter 2 as in the case of the Bayes rule. Neither of these attempts is successful, however, in determining the generalization error of the remaining learning rules.
In this chapter we will introduce several modifications of the central statistical mechanics method introduced in chapter 2 which will allow us to analyse the generalization behaviour of these remaining rules. The main observation is that all these learning rules can be interpreted as prescriptions to minimize appropriately chosen cost functions. Generalizing the concept of Gibbs learning to non-zero training error will pave the way to studying such minimization problems in a unified fashion.
Before embarking on these general considerations, however, we will discuss in the first section of this chapter how learning rules aiming at maximal stabilities are most conveniently analysed.
The main results of this chapter concerning the generalization error of the various rules are summarized in fig. 4.3 and table 4.1.
Maximal stabilities
A minor extension of the statistical mechanics formalism introduced in chapter 2 is sufficient to analyse the generalization performance of the adatron and the pseudo-inverse rule. The common feature of these two rules is that they search for couplings with maximal stabilities, formalized by the maximization of the stability parameter K.
So far we have focused on the performance of various learning rules as a function of the size of the training set with examples which are all selected before training starts and remain available during the whole training period. However, in both real life and many practical situations, the training examples come and go with time. Learning then has to proceed on-line, using only the training example which is available at any particular time. This is to be contrasted with the previous scenario, called off-line or batch learning, in which all the training examples are available at all times.
For the Hebb rule, the off-line and on-line scenario coincide: each example provides an additive contribution to the synaptic vector, which is independent of the other examples. We mentioned already in chapter 3 that this rule performs rather badly for large training sets, precisely because it treats all the learning examples in exactly the same way. The purpose of this chapter is to introduce more advanced or alternative on-line learning rules, and to compare their performance with that of their off-line versions.
Stochastic gradient descent
In an on-line scenario, the training examples are presented once and in a sequential order and the coupling vector J is updated at each time step using information from this single example only.
In this chapter we investigate ICA models in which the number of sources, M, may be less than the number of sensors, N: so-called non-square mixing.
The ‘extra’ sensor observations are explained as observation noise. This general approach may be called Probabilistic Independent Component Analysis (PICA) by analogy with the Probabilistic Principal Component Analysis (PPCA) model of Tipping & Bishop [I9971; ICA and PCA don't have observation noise, PICA and PPCA do.
Non-square ICA models give rise to a likelihood model for the data involving an integral which is intractable. In this chapter we build on previous work in which the integral is estimated using a Laplace approximation. By making the further assumption that the unmixing matrix lies on the decorrelating manifold we are able to make a number of simplifications. Firstly, the observation noise can be estimated using PCA methods, and, secondly, optimisation takes place in a space having a much reduced dimensionality, having order M2 parameters rather than M × N. Again, building on previous work, we derive a model order selection criterion for selecting the appropriate number of sources. This is based on the Laplace approximation as applied to the decorrelating manifold. This is then compared with PCA model order selection methods on music and EEG datasets.
Non-Gaussianity is of paramount importance in ICA estimation. Without non-Gaussianity the estimation is not possible at all (unless the independent components have time-dependences). Therefore, it is not surprising that non-Gaussianity could be used as a leading principle in ICA estimation.
In this chapter, we derive a simple principle of ICA estimation: the independent components can be found as the projections that maximize non-Gaussianity. In addition to its intuitive appeal, this approach allows us to derive a highly efficient ICA algorithm, Fast ICA. This is a fixed-point algorithm that can be used for estimating the independent components one by one. At the end of the chapter, it will be seen that it is closely connected to maximum likelihood or infomax estimation as well.
Whitening
First, let us consider preprocessing techniques that are essential if we want to develop fast ICA methods.
The rather trivial preprocessing that is used in many cases is to centre x, i.e. subtract its mean vector m = ε{x} so as to make x a zero-mean variable. This implies that s is zero-mean as well. This preprocessing is made solely to simplify the ICA algorithms: it does not mean that the mean could not be estimated. After estimating the mixing matrix A with centred data, we can complete the estimation by adding the mean vector of s back to the centred estimates of s.
An unsupervised classification algorithm is derived by modelling observed data as a mixture of several mutually exclusive classes that are each described by linear combinations of independent, non-Gaussian densities. The algorithm estimates the density of each class and is able to model class distributions with non-Gaussian structure. It can improve classification accuracy compared with standard Gaussian mixture models. When applied to images, the algorithm can learn efficient codes (basis functions) for images that capture the statistical structure of the images. We applied this method to the problem of unsupervised classification, segmentation and de-noising of images. This method was effective in classifying complex image textures such as trees and rocks in natural scenes. It was also useful for de-noising and filling in missing pixels in images with complex structures. The advantage of this model is that image codes can be learned with increasing numbers of classes thus providing greater flexibility in modelling structure and in finding more image features than in either Gaussian mixture models or standard ICA algorithms.
Introduction
Recently, Blind Source Separation by Independent Component Analysis has been applied to signal processing problems including speech enhancement, telecommunications and medical signal processing. ICA finds a linear non-orthogonal coordinate system in multivariate data determined by second- and higher-order statistics. The goal of ICA is to linearly transform the data in such a way that the transformed variables are as statistically independent from each other as possible [Jutten & Herault, 1991, Comon, 1994, Bell & Sejnowski, 1995, Cardoso & Laheld, 1996, Lee et al., 2000b]. ICA generalizes the technique of Principal Component Analysis (PCA) and, like PCA, has proven a useful tool for finding structure in data.