We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Chapter 6 addresses the problem of error estimation and resampling in both a theoretical and practical manner. The holdout method is reviewed and cast into the bias/variance framework. Simple resampling approaches such as cross-validation are also reviewed and important variations such as stratified cross-validation and leave-one-out are introduced. Multiple resampling approaches such as bootstrapping, randomization, and multiple trials of simple resampling approaches are then introduced and discussed.
Chapter 2 reviews the principles of statistics that are necessary for the discussion of machine learning evaluation methods, especially the statical analysis discussion of Chapter 7. In particular, it reviews the notions of random variables, distributions, confidence intervals, and hypothesis testing.
In Chapter 10, the book turns to practical considerations. In particular, it surveys the software engineering discipline with its rigorous software testing methods, and asks how these techniques can be adapted to the subfield of machine learning. The adaptation is not straightforward, as machine learning algorithms behave in non-deterministic ways aggravated by data, algorithm, and platform imperfections. These issues are discussed and some of the steps taken to handle them are reviewed. The chapter then turns to the practice of online testing and addresses the ethics of machine learning deployment. The chapter concludes with a discussion of current industry practice along with suggestions on how to improve the safety of industrial deployment in the future.
Chapter 5 starts with an analysis of the classification metrics presented in Chapter 4, outlining their strengths and weaknesses. It then presents more advanced metrics such as Cohen’s kappa, Youden’s index, and likelihood ratios. This is followed by a discussion about data and classifier complexities such as the class imbalance problem and classifier uncertainty that require particular scrutiny to ensure that the results are trustworthy. The chapter concludes with a detailed discussion of ROC analysis to complement its introduction in Chapter 4, and a presentation of other visualization metrics.
Chapter 3 discusses the field of machine learning from a theoretical perspective. The review will advance the discussion of advanced metrics in Chapter 5 and error estimation methods in Chapter 6. The specific concepts surveyed in this chapter include loss functions, empirical risk, generalization error, empirical and structural risk minimization, regularization, and learning bias. The unsupervised learning paradigm is also reviewed and the chapter concludes with a discussion of the bias/variance tradeoff.
Chapter 9 is devoted to evaluation methods for an important category of classical learning paradigms left out of Chapter 8 so as to receive fuller coverage: unsupervised learning. In this chapter, a number of different unsupervised learning schemes are considered and their evaluation discussed. The particular tasks considered are clustering and hierarchical clustering, dimensionality reduction, latent variable modeling, and generative models including probabilistic PCA, variational autoencoders, and GANs. Evaluation methodology is discussed discussed for each of these tasks.
Chapter 11 completes the discussion of Chapter 10 by raising the question of how to practice machine learning in a responsible manner. It describes the dangers of data bias, and surveys data bias detection and mitigation methods; it lists the benefits of explainability and discusses techniques, such as LIME and SHAP, that have been proposed to explain the decisions made by opaque models; it underlines the risks of discrimination and discusses how to enhance fairness and prevent discrimination in machine learning algorithms. The issues of privacy and security are then presented, and the need to practice human-centered machine learning emphasized. The chapter concludes with the important issues of repeatability, reproducibility, and replicability in machine learning.
Chapter 1 discusses the motivation for the book and the rationale for its organization into four parts: preliminary considerations, evaluation for classification, evaluation in other settings, and evaluation from a practical perspective. In more detail, the first part provides the statistical tools necessary for evaluation and reviews the main machine learning principles as well as frequently used evaluation practices. The second part discusses the most common setting in which machine learning evaluation has been applied: classification. The third part extends the discussion to other paradigms such as multi-label classification, regression analysis, data stream mining, and unsupervised learning. The fourth part broadens the conversation by moving it from the laboratory setting to the practical setting, specifically discussing issues of robustness and responsible deployment.
Chapter 8 introduces evaluation procedures for paradigms other than classification. In particular, it discusses evaluation for classical problems such as regression analysis, time-series analysis, outlier detection, and reinforcement learning, along with evaluation approaches for newer tasks such as positive-unlabelled classification, ordinal classification, multi-labeled classification, image segmentation, text generation, data stream mining, and lifelong learning.
In Chapter 7, the history of statistical analysis is reviewed and its legacy discussed. Four situations of interest to machine learning evaluation are subsequently discussed within different statistical paradigms: the comparison of two classifiers on a single domain; the comparison of multiple classifiers on a single domain; the comparison of two classifiers on multiple domains; and the comparison of multiple classifiers on multiple domains. The three statistical paradigms considered for each of these situations are the null hypothesis statistical testing (NHST) setting; an enhanced Fisher-flavored methodology that adds the notions of confidence intervals, effect size, and power analysis to NHST; and a newer approach based on Bayesian reasoning.
As machine learning applications gain widespread adoption and integration in a variety of applications, including safety and mission-critical systems, the need for robust evaluation methods grows more urgent. This book compiles scattered information on the topic from research papers and blogs to provide a centralized resource that is accessible to students, practitioners, and researchers across the sciences. The book examines meaningful metrics for diverse types of learning paradigms and applications, unbiased estimation methods, rigorous statistical analysis, fair training sets, and meaningful explainability, all of which are essential to building robust and reliable machine learning products. In addition to standard classification, the book discusses unsupervised learning, regression, image segmentation, and anomaly detection. The book also covers topics such as industry-strength evaluation, fairness, and responsible AI. Implementations using Python and scikit-learn are available on the book's website.