Search results for Pattern Recognition and Machine Learning

11 - A Bird's Eye View: Multifractals
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 193-208
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

For a fixed set of input examples, one can decompose the N-sphere into cells each consisting of all the perceptron coupling vectors J giving rise to the same classification of those examples. Several aspects of perceptron learning discussed in the preceding chapters are related to the geometric properties of this decomposition, which turns out to have random multifractal properties. Our outline of the mathematical techniques related to the multifractal method will of course be short and ad rem; see [172, 173] for a more detailed introduction. But this alternative description provides a deeper and unified view of the different learning properties of the perceptron. It highlights some of the more subtle aspects of the thermodynamic limit and its role in the statistical mechanics analysis of perceptron learning. In this way we finish our discussion of the perceptron with an encompassing multifractal description, preparing the way for the application of this approach to the analysis of multilayer networks.
The shattered coupling space
Consider a set of p = αN examples ξµ generated independently at random from the uniform distribution on the N-sphere. Each hyperplane perpendicular to one of these inputs cuts the coupling space of a spherical perceptron, which is the very same N-sphere, into two half-spheres according to the two possible classifications of the example.

14 - What Else?
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 259-274
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In this book we have discussed how various aspects of learning in artificial neural networks may be quantified by using concepts and techniques developed in the statistical mechanics of disordered systems. These methods grew out of the desire to understand some strange low-temperature properties of disordered magnets; nevertheless their usefulness for and efficiency in the analysis of a completely different class of complex systems underlines the generality and strength of the principles of statistical mechanics.
In this final chapter we have collected some additional examples of non-physical complex systems for which an analysis using methods of statistical mechanics similar to those employed for the study of neural networks has given rise to new and interesting results. Compared with the previous chapters, the discussions in the present one will be somewhat more superficial – merely pointing to the qualitative analogies with the problems elucidated previously, rather than working out the consequences in full detail. Moreover, some of the problems we consider are strongly linked to information processing and artificial neural networks, whereas others are not. In all cases quenched random variables are used to represent complicated interactions which are not known in detail, and the typical behaviour in a properly defined thermodynamic limit is of particular interest.
Support vector machines
The main reason which prevents the perceptron from being a serious candidate for the solution of many real-world learning problems is that it can only implement linearly separable Boolean functions.

3 - A Choice of Learning Rules
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 33-48
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

The Gibbs rule discussed in the previous chapter characterizes the typical generalization behaviour of the students forming the version space. It is hence well suited for a general theoretical analysis. For a concrete practical problem it is, however, hardly the best choice and there is a variety of other learning rules which are often more direct and may also show a better performance. The purpose of this chapter is to introduce a representative selection of these learning rules, to discuss some of their features, and to compare their properties with those of the Gibbs rule.
The Hebb rule
The oldest and maybe most important learning rule was introduced by D. Hebb in the late 1940s. It is, in fact, an application at the level of single neurons of the idea of Pavlov coincidence training. In his famous experiment, Pavlov showed how a dog, which was trained to receive its food when, at the same time, a light was being turned on, would also start to salivate when the light alone was lit. In some way, the coincidence of the two events, food and light, had established a connection in the brain of the dog such that, even when only one of the events occurred, the memory of the other would be stimulated. The basic idea behind the Hebb rule [32] is quite similar: strengthen the connection of neurons that fire together.

5 - Noisy Teachers
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 69-84
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

As a rule teachers are unreliable. From time to time they mix up questions or answer absentmindedly. How much can a student network learn about a target rule if some of the examples in the training set are corrupted by random noise? What is the optimal strategy for the student in this more complicated situation?
To analyse these questions in detail for the two-perceptron scenario is the aim of the present chapter. Let us emphasize that quite generally a certain robustness with respect to random influences is an indispensable requirement for any information processing system, both in biological and in technical contexts. If learning from examples were possible only for perfectly error-free training sets it would be of no practical interest. In fact, since the noise blurring the correct classifications of the teacher may usually be assumed to be independent of the examples, one expects that it will remain possible to infer the rule, probably at the expense of a larger training set.
A general feature of noisy generalization tasks is that the training set is no longer generated by a rule that can be implemented by the student. The problem is said to be unrealizable. A simple example is a training set containing the same input with different outputs, which is quite possible for noisy teachers. This means that for large enough training sets no student exists who is able to reproduce all classifications and the version space becomes empty.

Contents
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp v-viii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

7 - Discontinuous Learning
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 109-124
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

So far we have been considering learning scenarios in which generalization shows up as a gradual process of improvement with the generalization error ε decreasing continuously from its initial pure guessing value ε = 0.5 to the asymptotic limit ε = 0. In the present chapter we study systems which display a quite different behaviour with sudden changes of the generalization ability taking place during the learning process. The reason for this new feature is the presence of discrete degrees of freedom among the parameters, which are adapted during the learning process. As we will see, discontinuous learning is a rather subtle consequence of this discreteness and methods of statistical mechanics are well suited to describe the situation. In particular the abrupt changes which occur in the generalization process can be described as first order phase transitions well studied in statistical physics.
Smooth networks
The learning scenarios discussed so far have been described in the framework of statistical mechanics as a continuous shift of the balance between energetic and entropic terms. In the case of perfect learning the energy describes how difficult it is for the student vector to stay in the version space (see (2.13)). For independent examples it is naturally given as a sum over the training set and scales for large α as e ∼ αε since the generalization error ε gives the probability of error and hence of an additional cost when a new example is presented.

Appendices
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 275-312
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

1 - Getting Started
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 1-13
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In the present chapter we introduce the basic notions necessary to study learning problems within the framework of statistical mechanics. We also demonstrate the efficiency of learning from examples by the numerical analysis of a very simple situation. Generalizing from this example we will formulate the basic setup of a learning problem in statistical mechanics to be discussed in numerous modifications in later chapters.
Artificial neural networks
The statistical mechanics of learning has been developed primarily for networks of so-called formal neurons. The aim of these networks is to model some of the essential information processing abilities of biological neural networks on the basis of artificial systems with a similar architecture. Formal neurons, the microscopic building blocks of these artificial neural networks, were introduced more than 50 years ago by McCulloch and Pitts as extremely simplified models of the biological neuron [1]. They are bistable linear threshold elements which are either active or passive, to be denoted in the following by a binary variable S = ±1. The state Si of a given neuron i changes with time because of the signals it receives through its synaptic couplings Jij from either the “outside world” or other neurons j.
More precisely, neuron i sums up the incoming activity of all the other neurons weighted by the corresponding synaptic coupling strengths to yield the post-synaptic potential ∑jJij Sj and compares the result with a threshold θi specific to neuron i.

2 - Perceptron Learning – Basics
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 14-32
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

6 - The Storage Problem
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 85-108
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

There is an important extreme case of learning from a noisy source as discussed in the previous chapter which deserves special consideration. It concerns the situation of an extremely noisy teacher in which the added noise is so strong that it completely dominates the teacher's output. The task for the student is then to reproduce a mapping with no correlations between input and output so that the notion of a teacher actually becomes obsolete. The central question is how many input–output pairs can typically be implemented by an appropriate choice of the couplings J. This is the so-called storage problem. Its investigation yields a measure for the flexibility of the network under consideration with respect to the implementation of different mappings between input and output.
The reason why we include a discussion of this case in the present book, which is mainly devoted to the generalization behaviour of networks, is threefold. Firstly, there is a historical point: in the physics community the storage properties of neural networks were discussed before emphasis was laid on their ability to learn from examples, and several important concepts have been introduced in connection with these earlier investigations. Secondly, in several situations the storage problem is somewhat simpler to analyse and therefore forms a suitable starting point for the more complicated investigation of the generalization performance. Thirdly, we will see in chapter 10 that the flexibility of a network architecture with respect to the implementation of different input–output relations also gives useful information on its generalization behaviour.

Preface
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp ix-xii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Understanding intelligent behaviour has always been fascinating to both laymen and scientists. The question has become very topical through the concurrence of a number of different issues. First, there is a growing awareness of the computational limits of serial computers, while parallel computation is gaining ground, both technically and conceptually. Second, several new non-invasive scanning techniques allow the human brain to be studied from its collective behaviour down to the activity of single neurons. Third, the increased automatization of our society leads to an increased need for algorithms that control complex machines performing complex tasks. Finally, conceptual advances in physics, such as scaling, fractals, bifurcation theory and chaos, have widened its horizon and stimulate the modelling and study of complex non-linear systems. At the crossroads of these developments, artificial neural networks have something to offer to each of them.
The observation that these networks can learn from examples and are able to discern an underlying rule has spurred a decade of intense theoretical activity in the statistical mechanics community on the subject. Indeed, the ability to infer a rule from a set of examples is widely regarded as a sign of intelligence. Without embarking on a thorny discussion about the nature or definition of intelligence, we just note that quite a few of the problems posed in standard IQ tests are exactly of this nature: given a sequence of objects (letters, pictures, …) one is asked to continue the sequence “meaningfully”, which requires one to decipher the underlying rule.

12 - Multilayer Networks
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 209-236
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

In the preceding chapters we have described various properties of learning in the perceptron, exploiting the fact that its simple architecture allows a rather detailed mathematical analysis. However, the perceptron suffers from a major deficiency that led to its demise in the late 1960s: being able only to implement linearly separable Boolean functions its computational capacities are rather limited. An obvious generalization is feed-forward multilayer networks with one or more intermediate layers of formal neurons neurons between input and output (cf. fig. 1.1c). On the one hand these may be viewed as being composed of individual perceptrons, so that their theoretical analysis may build on what has been accomplished for the perceptron. On the other hand the addition of internal degrees of freedom makes them computationally much more powerful. In fact multilayer neural networks are able to realize all possible Boolean functions between input and output, which makes them an attractive choice for practical applications. There is also a neurophysiological motivation for the study of multilayer networks since most neurons in biological neural nets are interneurons neither directly connected to sensory inputs nor to motor outputs.
The higher complexity of multilayer networks as compared to the simple perceptron makes the statistical mechanics analysis of their learning abilities more complicated and in general precludes the general and detailed characterization which was possible for the perceptron. Nevertheless, for tailored architectures and suitable learning scenarios very instructive results may be obtained, some of which will be discussed in the present chapter.

Frontmatter
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

13 - On-line Learning in Multilayer Networks
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 237-258
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

4 - Augmented Statistical Mechanics Formulation
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 49-68
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

The generalization performance of some of the learning rules introduced in the previous chapter could be characterized either by using simple arguments from statistics as in the case of the Hebb rule, or by exploiting our results on Gibbs learning obtained in chapter 2 as in the case of the Bayes rule. Neither of these attempts is successful, however, in determining the generalization error of the remaining learning rules.
In this chapter we will introduce several modifications of the central statistical mechanics method introduced in chapter 2 which will allow us to analyse the generalization behaviour of these remaining rules. The main observation is that all these learning rules can be interpreted as prescriptions to minimize appropriately chosen cost functions. Generalizing the concept of Gibbs learning to non-zero training error will pave the way to studying such minimization problems in a unified fashion.
Before embarking on these general considerations, however, we will discuss in the first section of this chapter how learning rules aiming at maximal stabilities are most conveniently analysed.
The main results of this chapter concerning the generalization error of the various rules are summarized in fig. 4.3 and table 4.1.
Maximal stabilities
A minor extension of the statistical mechanics formalism introduced in chapter 2 is sufficient to analyse the generalization performance of the adatron and the pseudo-inverse rule. The common feature of these two rules is that they search for couplings with maximal stabilities, formalized by the maximization of the stability parameter K.

9 - On-line Learning
A. Engel, Otto-von-Guericke-Universität Magdeburg, Germany, C. Van den Broeck, Limburgs Universitair Centrum, Belgium
Book:

Statistical Mechanics of Learning

Published online:

05 June 2012

Print publication:

29 March 2001, pp 149-175
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

So far we have focused on the performance of various learning rules as a function of the size of the training set with examples which are all selected before training starts and remain available during the whole training period. However, in both real life and many practical situations, the training examples come and go with time. Learning then has to proceed on-line, using only the training example which is available at any particular time. This is to be contrasted with the previous scenario, called off-line or batch learning, in which all the training examples are available at all times.
For the Hebb rule, the off-line and on-line scenario coincide: each example provides an additive contribution to the synaptic vector, which is independent of the other examples. We mentioned already in chapter 3 that this rule performs rather badly for large training sets, precisely because it treats all the learning examples in exactly the same way. The purpose of this chapter is to introduce more advanced or alternative on-line learning rules, and to compare their performance with that of their off-line versions.
Stochastic gradient descent
In an on-line scenario, the training examples are presented once and in a sequential order and the coupling vector J is updated at each time step using information from this single example only.

12 - ICA: model order selection and dynamic source models
- By W.D. Penny, University of Oxford, S.J. Roberts, University of Oxford, R.M. Everson, University of Exeter
Edited by Stephen Roberts, University of Oxford, Richard Everson, University of Exeter
Book:

Independent Component Analysis

Published online:

05 July 2014

Print publication:

01 March 2001, pp 299-314
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introduction
In this chapter we investigate ICA models in which the number of sources, M, may be less than the number of sensors, N: so-called non-square mixing.
The ‘extra’ sensor observations are explained as observation noise. This general approach may be called Probabilistic Independent Component Analysis (PICA) by analogy with the Probabilistic Principal Component Analysis (PPCA) model of Tipping & Bishop [I9971; ICA and PCA don't have observation noise, PICA and PPCA do.
Non-square ICA models give rise to a likelihood model for the data involving an integral which is intractable. In this chapter we build on previous work in which the integral is estimated using a Laplace approximation. By making the further assumption that the unmixing matrix lies on the decorrelating manifold we are able to make a number of simplifications. Firstly, the observation noise can be estimated using PCA methods, and, secondly, optimisation takes place in a space having a much reduced dimensionality, having order M2 parameters rather than M × N. Again, building on previous work, we derive a model order selection criterion for selecting the appropriate number of sources. This is based on the Laplace approximation as applied to the decorrelating manifold. This is then compared with PCA model order selection methods on music and EEG datasets.

2 - Fast ICA by a fixed-point algorithm that maximizes non-Gaussianity
- By Aapo Hyvärinen, Helsinki University of Technology
Edited by Stephen Roberts, University of Oxford, Richard Everson, University of Exeter
Book:

Independent Component Analysis

Published online:

05 July 2014

Print publication:

01 March 2001, pp 71-94
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Non-Gaussianity is of paramount importance in ICA estimation. Without non-Gaussianity the estimation is not possible at all (unless the independent components have time-dependences). Therefore, it is not surprising that non-Gaussianity could be used as a leading principle in ICA estimation.
In this chapter, we derive a simple principle of ICA estimation: the independent components can be found as the projections that maximize non-Gaussianity. In addition to its intuitive appeal, this approach allows us to derive a highly efficient ICA algorithm, Fast ICA. This is a fixed-point algorithm that can be used for estimating the independent components one by one. At the end of the chapter, it will be seen that it is closely connected to maximum likelihood or infomax estimation as well.
Whitening
First, let us consider preprocessing techniques that are essential if we want to develop fast ICA methods.
The rather trivial preprocessing that is used in many cases is to centre x, i.e. subtract its mean vector m = ε{x} so as to make x a zero-mean variable. This implies that s is zero-mean as well. This preprocessing is made solely to simplify the ICA algorithms: it does not mean that the mean could not be estimated. After estimating the mixing matrix A with centred data, we can complete the estimation by adding the mean vector of s back to the centred estimates of s.

Contributors
Edited by Stephen Roberts, University of Oxford, Richard Everson, University of Exeter
Book:

Independent Component Analysis

Published online:

05 July 2014

Print publication:

01 March 2001, pp xi-xii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

9 - Image processing methods using ICA mixture models
- By T. W. Lee, Usa, M.S. Lewicki, Usa
Edited by Stephen Roberts, University of Oxford, Richard Everson, University of Exeter
Book:

Independent Component Analysis

Published online:

05 July 2014

Print publication:

01 March 2001, pp 234-253
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

An unsupervised classification algorithm is derived by modelling observed data as a mixture of several mutually exclusive classes that are each described by linear combinations of independent, non-Gaussian densities. The algorithm estimates the density of each class and is able to model class distributions with non-Gaussian structure. It can improve classification accuracy compared with standard Gaussian mixture models. When applied to images, the algorithm can learn efficient codes (basis functions) for images that capture the statistical structure of the images. We applied this method to the problem of unsupervised classification, segmentation and de-noising of images. This method was effective in classifying complex image textures such as trees and rocks in natural scenes. It was also useful for de-noising and filling in missing pixels in images with complex structures. The advantage of this model is that image codes can be learned with increasing numbers of classes thus providing greater flexibility in modelling structure and in finding more image features than in either Gaussian mixture models or standard ICA algorithms.
Introduction
Recently, Blind Source Separation by Independent Component Analysis has been applied to signal processing problems including speech enhancement, telecommunications and medical signal processing. ICA finds a linear non-orthogonal coordinate system in multivariate data determined by second- and higher-order statistics. The goal of ICA is to linearly transform the data in such a way that the transformed variables are as statistically independent from each other as possible [Jutten & Herault, 1991, Comon, 1994, Bell & Sejnowski, 1995, Cardoso & Laheld, 1996, Lee et al., 2000b]. ICA generalizes the technique of Principal Component Analysis (PCA) and, like PCA, has proven a useful tool for finding structure in data.

Pattern Recognition and Machine Learning

Refine search

Refine search

Actions for selected content:

2275 results in Pattern Recognition and Machine Learning

11 - A Bird's Eye View: Multifractals

Summary

14 - What Else?

Summary

3 - A Choice of Learning Rules

Summary

5 - Noisy Teachers

Summary

Contents

7 - Discontinuous Learning

Summary

Appendices

1 - Getting Started

Summary

2 - Perceptron Learning – Basics

6 - The Storage Problem

Summary

Preface

Summary

12 - Multilayer Networks

Summary

Frontmatter

13 - On-line Learning in Multilayer Networks

4 - Augmented Statistical Mechanics Formulation

Summary

9 - On-line Learning

Summary

12 - ICA: model order selection and dynamic source models

Summary

2 - Fast ICA by a fixed-point algorithm that maximizes non-Gaussianity

Summary

Contributors

9 - Image processing methods using ICA mixture models

Summary

Pattern Recognition and Machine Learning

Refine search

Refine search

Actions for selected content:

Save Search

2275 results in Pattern Recognition and Machine Learning

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary