To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Chapter 10 covers the random forests algorithm for classification. Presented are also the impurity metrics applicable to splitting nodes in classification trees (Gini, entropy, and misclassification impurity), as well as permutation-based and impurity-based variable importance measures.
Chapter 12 presents discriminant analysis – a classical (and powerful) supervised learning approach for classification. Discussed are Fisher’s discriminant analysis, as well as Gaussian linear, quadratic, and regularized discriminant analysis. The chapter concludes with a discussion of partial least squares discriminant analysis, which is still popular in some application areas, even if its application to high-dimensional data is likely to result in solutions that are suboptimal in terms of predictive abilities and interpretability (alternative approaches are recommended).
Chapter 17 describes the second real-life study, whose goal is the identification of multivariate biomarkers for liver cancer. This study implements parallel recursive feature elimination experiments coupled with random forests and support vector machines. Included are also considerations for rebalancing class proportions. Three multivariate biomarkers for liver cancer have been identified. The study has been performed in an R environment, and R scripts for all of its steps are provided.
Chapters 16 presents the first of the two real-life multivariate biomarker discovery studies included in the book. The goal of this study – which implements the method presented in Chapters 14 and 15 – is to identify the essential gene expression patterns and a multivariate biomarker common for multiple types of cancer. This study is based on the TCGA RNA-Seq data of 3,528 patients and 20,530 gene expression variables; the data represent five tumor types of five different tissues. A parsimonious multivariate biomarker (consisting of ten genes) with high sensitivity and specificity has been identified.
Chapter 9 presents support vector regression (SVR), a relatively newer supervised learning algorithm for predictive regression modeling, which – like random forests for regression – also may outperform the least-squares-based methods. Discussed is ε-insensitive loss used by SVR, the ε-tube concept, as well as algorithms for linear and nonlinear SVRs.
Chapter 4 provides a detailed coverage of methods for the evaluation of predictive models: the methods applicable to regression models implementing estimation biomarkers, as well as methods evaluating binary and multiclass classification models. Discussion of resampling techniques is accompanied by accentuating the danger of information leakage and by emphasizing the paramount importance of avoiding internal validation. Discussion of metrics for the evaluation of classification biomarkers includes the issue of proper and improper interpretation of sensitivity and specificity, illustrated by an example of a screening biomarker targeting a population with low prevalence of the tested disease. For such biomarkers, positive predictive value may be unacceptably low even when the biomarker has a very high specificity and sensitivity. Discussed in this chapter are also misclassification costs and incorporating them into cost-sensitive classification.
Multivariate biomarker discovery is increasingly important in the realm of biomedical research, and is poised to become a crucial facet of personalized medicine. This will prompt the demand for a myriad of novel biomarkers representing distinct 'omic' biosignatures, allowing selection and tailoring treatments to the various individual characteristics of a particular patient. This concise and self-contained book covers all aspects of predictive modeling for biomarker discovery based on high-dimensional data, as well as modern data science methods for identification of parsimonious and robust multivariate biomarkers for medical diagnosis, prognosis, and personalized medicine. It provides a detailed description of state-of-the-art methods for parallel multivariate feature selection and supervised learning algorithms for regression and classification, as well as methods for proper validation of multivariate biomarkers and predictive models implementing them. This is an invaluable resource for scientists and students interested in bioinformatics, data science, and related areas.
With diploid organisms, one is interested not only in discovering variants but also in discovering to which of the two haplotypes each variant belongs. One would thus like to identify the variants that are co-located on the same haplotype, a process called haplotype phasing. Assume we have managed to do haplotype phasing for several individuals. It is then of interest to do haplotype matching, that is, to locate long contiguous blocks shared by multiple individuals. The chapter covers algorithms and complexity analysis of these key haplotype analysis tasks. A close connection between classical indexes and a tailored data structure called the positional BWT index is established.
Analysing the content of a biological sequence can often be modeled as a segmentation problem. For example, one may wish to segment a genome in coding and non-coding regions, where only the former are translated to proteins. Statistical features of what genes usually look like can be used to derive an optimization framework. This process can be formalized through hidden Markov models, and the underlying segmentation problem can be solved using dynamic programming. This chapter introduces the key methods related to such optimization.
Classical index structures like suffix trees are powerful, but they occupy much more space than the data they are built on. Many space-efficient alternatives exist that occupy space close to the input data. This chapter covers such data structures based on the Burrows–Wheeler transform (BWT). A special emphasis is given to the bidirectional BWT index, which can be used for solving basic genome analysis tasks by simulating suffix tree exploration without any sacrifice in run time. Space-efficient representations of de Bruijn graphs are also covered.
Graphs are a fundamental model for representing various relations among data. The aim of this chapter is to present some basic problems and techniques relating to graphs, mainly for finding particular paths in directed and undirected graphs. In the following chapters, we deal with various problems in biological sequence analysis that can be reduced to one of these basic ones.
In this chapter we assume that we have a collection of reads from all the different (copies of the) transcripts of a gene. We start by showing how to extend read alignment techniques to short RNA reads, and later we show how to exploit the output of genome analysis techniques to obtain an aligner for long reads of RNA transcripts. Our final goal is to assemble the reads into the different RNA transcripts and to estimate the expression level of each transcript. The main difficulty of this problem, which we call multi-assembly, arises from the fact that the transcripts share identical substrings. We illustrate different scenarios, and corresponding multi-assembly formulations, which we then solve using network flow techniques.
A full-text index for a string T is a data structure that is built once and that is kept in memory for answering an arbitrarily large number of queries on the position and frequency of substrings of T. Such queries can be used for speeding-up dynamic programming algorithms tailored for mapping reads to a reference genome – a fundamental task in the analysis of high-throughput sequencing data. This chapter covers the classical full-text indexes and the like, including k-mer indexes, suffix arrays, and suffix trees. Linear-time algorithms for suffix sorting and for basic genome analysis tasks, such as finding maximal exact matches, are also presented.