We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. Introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite block-length approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC Bayes and variational principle, Kolmogorov's metric entropy, strong data processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by a solutions manual for instructors, and additional standalone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
The main reason for the possibility of data compression is the experimental (empirical) law: Real-world sources produce very restricted sets of sequences. How do we model these restrictions? Chapter 10 looks at the first of three compression types that we will consider: variable-length lossless compression.
In this chapter our goal is to determine the achievable region of the exponent pairs for the type-I and type-II error probabilities. Our strategy is to apply the achievability and (strong) converse bounds from Chapter 14 in conjunction with the large-deviations theory developed in Chapter 15. After characterizing the full tradeoff we will discuss an adaptive setting of hypothesis testing where, instead of committing ahead of time to testing on the basis of n samples, one can decide adaptively whether to request more samples or stop. We will find out that adaptivity greatly increases the region of achievable error exponents and will learn about the sequential probability ratio test (SPRT) of Wald. In the closing sections we will discuss relations to more complicated settings in hypothesis testing: one with composite hypotheses and one with communication constraints.
The operation of mapping (naturally occurring) continuous time/analog signals into (electronics-friendly) discrete/digital signals is known as quantization, which is an important subject in signal processing in its own right. In information theory, the study of optimal quantization is called rate-distortion theory, introduced by Shannon in 1959. To start, in Chapter 24 we will take a closer look at quantization, followed by the information-theoretic formulation. A simple (and tight) converse bound is then given, with the matching achievability bound deferred to Chapter 25.
In Chapter 4 we collect some results on variational characterizations of information measures. It is a well-known method in analysis to study a functional by proving variational characterizations representing it as a supremum or infimum of some other, simpler (often linear) functionals. Such representations can be useful for multiple purposes:
Convexity: the pointwise supremum of convex functions is convex.
Regularity: the pointwise supremum of lower semicontinuous (lsc) functions is lsc.
Bounds: the upper and lower bounds on the functional follow by choosing good solutions in the optimization problem.
We will see in this chapter that divergence has two different sup-characterizations (over partitions and over functions). The mutual information is more special. In addition to inheriting the ones from Kullback–Leibler divergence, it possesses two extra: an inf-representation over (centroid) measures and a sup-representation over Markov kernels. As applications of these variational characterizations, we discuss the Gibbs variational principle, which serves as the basis of many modern algorithms in machine learning, including the EM algorithm and variational autoencoders; see Section 4.4. An important theoretical construct in machine learning is the idea of PAC-Bayes bounds (Section 4.8*).
So far we have been focusing on the paradigm for one-way communication: data are mapped to codewords and transmitted, and later decoded based on the received noisy observations. Chapter 23 looks at the more practical setting (except for storage), where the communication frequently goes in both ways so that the receiver can provide certain feedback to the transmitter. As a motivating example, consider the communication channel of the downlink transmission from a satellite to earth. Downlink transmission is very expensive (power constraint at the satellite), but the uplink from earth to the satellite is cheap which makes virtually noiseless feedback readily available at the transmitter (satellite). In general, channel with noiseless feedback is interesting when such asymmetry exists between uplink and downlink. Even in less ideal settings, noisy or partial feedbacks are commonly available that can potentially improve the reliability or complexity of communication. In the first half of our discussion, we shall follow Shannon to show that even with noiseless feedback “nothing” can be gained in the conventional setup. In the process, we will also introduce the concept of Massey’s directed information. In the second half of the Chapter we examine situations where feedback is extremely helpful: low probability of error, variable transmission length and variable transmission power.
In Chapter 21 we will consider an interesting variation of the channel coding problem. Instead of constraining the blocklength (i.e., the number of channel uses), we will constrain the total cost incurred by the codewords. The motivation is the following. Consider a deep-space probe that has a k-bit message that needs to be delivered to Earth (or a satellite orbiting it). The duration of transmission is of little worry for the probe, but what is really limited is the amount of energy it has stored in its battery. In this chapter we will learn how to study this question abstractly and how this fundamental limit is related to communication over continuous-time channels.
In Chapter 25 we present the hard direction of the rate-distortion theorem: the random coding construction of a quantizer. This method is extended to the development of a covering lemma and soft-covering lemma, which lead to the sharp result of Cuff showing that the fundamental limit of channel simulation is given by Wyner’s common information. We also derive (a strengthened form of) Han and Verdú’s results on approximating output distributions in Kullback–Leibler.
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. The book introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite blocklength approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning, and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC-Bayes and variational principle, Kolmogorov’s metric entropy, strong data-processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by additional stand-alone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
The topic of this chapter is the deterministic (worst-case) theory of quantization. The main object of interest is the metric entropy of a set, which allows us to answer two key questions:
(1) covering number: the minimum number of points to cover a set up to a given accuracy;
(2) packing number: the maximal number of elements of a given set with a prescribed minimum pairwise distance.
The foundational theory of metric entropy was put forth by Kolmogorov, who, together with his students, also determined the behavior of metric entropy in a variety of problems for both finite and infinite dimensions. Kolmogorov’s original interest in this subject stems from Hilbert’s thirteenth problem, which concerns the possibility or impossibility of representing multivariable functions as compositions of functions of fewer variables. Metric entropy has found numerous connections to and applications in other fields, such as approximation theory, empirical processes, small-ball probability, mathematical statistics, and machine learning.
In Chapter 17 we introduce the concept of an error-correcting code (ECC). We will spend time discussing what it means for a code to have low probability of error, and what is the optimum (ML or MAP) decoder. On the special case of coding for the binary symmetric channel (BSC), we showcase the evolution of our understanding of fundamental limits from pre-Shannon’s to modern finite blocklength. We also briefly review the history of ECCs. We conclude with a conceptually important proof of a weak converse (impossibility) bound for the performance of ECCs.
Chapter 33 introduces the strong data-processing inequalities (SDPIs), which are quantitative strengthening of the DPIs in Part I. As applications we show how to apply SDPI to deduce lower bounds for various estimation problems on graphs or in distributed settings. The purpose of this chapter is two-fold. First, we want to introduce general properties of the SDPI coefficients. Second, we want to show how SDPIs help prove sharp lower (impossibility) bounds on statistical estimation questions. The flavor of the statistical problems in this chapter is different from the rest of the book in that here the information about the unknown parameter θ is either more “thinly spread” across a high-dimensional vector of observations than in classical X = θ + Z type of models (see spiked Wigner and tree-coloring examples), or distributed across different terminals (as in correlation and mean estimation examples).
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. The book introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite blocklength approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning, and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC-Bayes and variational principle, Kolmogorov’s metric entropy, strong data-processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by additional stand-alone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
So far our discussion on information-theoretic methods has been mostly focused on statistical lower bounds (impossibility results), with matching upper bounds obtained on a case-by-case basis. In Chapter 32 we will discuss three information-theoretic upper bounds for statistical estimation under KL divergence (Yang–Barron), Hellinger (Le Cam–Birgé), and total variation (Yatracos) loss metrics. These three results apply to different loss functions and are obtained using completely different means. However, they take on exactly the same form, involving the appropriate metric entropy of the model. In particular, we will see that these methods achieve minimax optimal rates for the classical problem of density estimation under smoothness constraints.
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. The book introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite blocklength approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning, and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC-Bayes and variational principle, Kolmogorov’s metric entropy, strong data-processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by additional stand-alone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
Chapter 29 gives an exposition of the classical large-sample asymptotics for smooth parametric models in fixed dimensions, highlighting the role of Fisher information introduced in Chapter 2. Notably, we discuss how to deduce classical lower bounds (Hammersley–Chapman–Robbins, Cramér–Rao, van Trees) from the variational characterization and the data-processing inequality (DPI) of χ2-divergence in Chapter 7.