To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter gives a minimalistic, combinatorial introduction to molecular biology, omitting the description of most biochemical processes and focusing on inputs and outputs, abstracted as mathematical objects.
This chapter connects the alignment techniques and space-efficient data structures covered in earlier chapters. It shows how to use BWT indexes for alignining sequencing reads to a reference genome. This powerful read mapping procedure enables variant calling and genotyping of new individuals from a species whose reference genome has already been assembled.
An alignment of two sequences aims to highlight how much in common the two sequences have. In computational biology, an alignment is a prediction of the evolutionary steps between the two sequences. Different costs for such steps can be assigned, and then one can seek for an optimal alignment. This chapter gives a comprehensive introduction to the dynamic programming algorithms developed for various alignment formulations.
This chapter shows how to perform analysis and comparison of genomes without assuming a reference genome to be available. The bidirectional BWT index turns out to be essential here, and the chapter covers a comprehensive set of techniques to manipulate this data structure. The algorithms covered include computing maximal exact/unique matches, substring kernels, matching statistics, and Jaccard similarity.
Several large-scale studies aim to build comprehensive catalogs of all the variants in a population, for example all the frequent variants in a species or all the variants in a group of individuals with a specific trait or disease. Such catalogs are the substrate for subsequent genome-wide association studies that aim to correlate variants to traits, and ultimately to personalized treatments. Such catalogs can also be leveraged for making basic analysis tasks, such as read alignment, using not just one reference genome but a pangenome data structure representing all genomes in the catalogue. The chapter gives an overview of different pangenome data structures and their applications. Selected data structures are covered in more depth, including the r-index.
Assume that a drop of seawater contains cells from many distinct species. Sequencing such a mixed sample and figuring out the relative abundancy of every species is a key problem in metagenomics. This chapter explores techniques for metagenomics analysis in different settings, for example with and without assuming that reference sequences are available. To solve these problems, we use techniques including tailored k-mer-based analyses, bidirectional BWT indexing, and network flows.
This chapter presents the minimal setup of data structures required to follow the rest of the book in a self-contained manner. Balanced binary trees are enhanced to solve dynamic range minimum queries. Bitvector rank and select data structures and their extensions to larger alphabets with wavelet tree are covered. Then a special structure for solving static range minimum queries is derived. The chapter ends with a concise description of hashing primitives, such as perfect hashing, Bloom filters, minimizers, and the Rabin–Karp rolling hash.
Throughout the book we mostly assume the genome sequence under study to be known. In this chapter we look at strategies for how to assemble fragments of DNA into longer contiguous blocks, and eventually into chromosomes. This chapter is partitioned into sections roughly following the workflow of a de novo assembly project, namely, error correction, contig assembly, scaffolding, and gap filling. Algorithms working with de Bruijn graphs and overlap graphs are studied.
In this chapter we show that many optimization problems can be reduced to a network flow problem. This polynomially solvable problem is a powerful model, which has found a remarkable array of applications. Roughly stated, in a network flow problem, one is given a transportation network and is required to find the optimal way of sending some content through this network. The chapter covers basic primitives around a general flow formulation called the minimum-cost flow.
This chapter gives an introduction to complexity analysis, data representations, and reductions. In addition, the Knuth–Morris–Pratt algorithm is covered to give some taste of dynamic programming – a technique introduced in Chapters 4 and 6 and used extensively thereafter.
A pragmatic problem arising in the analysis of biological sequences is that collections of genomes, and especially collections of read sets consisting of material from many species, occupy too much space. This chapter explores techniques to efficiently compress such collections. Several algorithms related to Lempel–Ziv factorization are covered, as well as the prefix-free parsing technique to run-length encode the Burrows–Wheeler transform of a collection of genomes.