To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter we look at a number of important refinements that have been developed for certain core string edit and alignment problems. These refinements either speed up a dynamic programming solution, reduce its space requirements, or extend its utility.
Computing alignments in only linear space
One of the defects of dynamic programming for all the problems we have discussed is that the dynamic programming tables use Θ(nm) space when the input strings have length n and m. (When we talk about the space used by a method, we refer to the maximum space ever in use simultaneously. Reused space does not add to the count of space use.) It is quite common that the limiting resource in string alignment problems is not time but space. That limit makes it difficult to handle large strings, no matter how long we may be willing to wait for the computation to finish. Therefore, it is very valuable to have methods that reduce the use of space without dramatically increasing the time requirements.
Hirschberg [224] developed an elegant and practical space-reduction method that works for many dynamic programming problems. For several string alignment problems, this method reduces the required space from Θ(nm) to O(n) (for n < m) while only doubling the worst-case time bound. Miller and Myers expanded on the idea and brought it to the attention of the computational biology community [344].
We now begin the discussion of an amazing result that greatly extends the usefulness of suffix trees (in addition to many other applications).
Definition In a rooted tree T, a node u is an ancestor of a node v if u is on the unique path from the root to v. With this definition a node is an ancestor of itself. A proper ancestor of v refers to an ancestor that is not v.
Definition In a rooted tree T, the lowest common ancestor (lca) of two nodes x and y is the deepest node in T that is an ancestor of both x and y.
For example, in Figure 8.1 the lca of nodes 6 and 10 is node 5 while the lca of 6 and 3 is 1.
The amazing result is that after a linear amount of preprocessing of a rooted tree, any two nodes can then be specified and their lowest common ancestor found in constant time. That is, a rooted tree with n nodes is first preprocessed in O(n) time, and thereafter any lowest common ancestor query takes only constant time to solve, independent of n. Without preprocessing, the best worst-case time bound for a single query is Θ(n), so this is a most surprising and useful result. The lca result was first obtained by Harel and Tarjan [214] and later simplified by Schieber and Vishkin [393]. The exposition here is based on the later approach.
This chapter develops a number of classical comparison-based matching algorithms for the exact matching problem. With suitable extensions, all of these algorithms can be implemented to run in linear worst-case time, and all achieve this performance by preprocessing pattern P. (Methods that preprocess T will be considered in Part II of the book.) The original preprocessing methods for these various algorithms are related in spirit but are quite different in conceptual difficulty. Some of the original preprocessing methods are quite difficult. This chapter does not follow the original preprocessing methods but instead exploits fundamental preprocessing, developed in the previous chapter, to implement the needed preprocessing for each specific matching algorithm.
Also, in contrast to previous expositions, we emphasize the Boyer–Moore method over the Knuth-Morris-Pratt method, since Boyer–Moore is the practical method of choice for exact matching. Knuth-Morris-Pratt is nonetheless completely developed, partly for historical reasons, but mostly because it generalizes to problems such as real-time string matching and matching against a set of patterns more easily than Boyer–Moore does. These two topics will be described in this chapter and the next.
The Boyer–Moore Algorithm
As in the naive algorithm, the Boyer–Moore algorithm successively aligns P with T and then checks whether P matches the opposing characters of T. Further, after the check is complete, P is shifted right relative to T just as in the naive algorithm. However, the Boyer–Moore algorithm contains three clever ideas not contained in the naive algorithm: the right-to-left scan, the bad character shift rule, and the good suffix shift rule.
In this chapter we consider the inexact matching and alignment problems that form the core of the field of inexact matching and others that illustrate the most general techniques. Some of those problems and techniques will be further refined and extended in the next chapters. We start with a detailed examination of the most classic inexact matching problem solved by dynamic programming, the edit distance problem. The motivation for inexact matching (and, more generally, sequence comparison) in molecular biology will be a recurring theme explored throughout the rest of the book. We will discuss many specific examples of how string comparison and inexact matching are used in current molecular biology. However, to begin, we concentrate on the purely formal and technical aspects of defining and computing inexact matching.
The edit distance between two strings
Frequently, one wants a measure of the difference or distance between two strings (for example, in evolutionary, structural, or functional studies of biological strings; in textual database retrieval; or in spelling correction methods). There are several ways to formalize the notion of distance between strings. One common, and simple, formalization [389, 299], called edit distance, focuses on transforming (or editing) one string into the other by a series of edit operations on individual characters. The permitted edit operations are insertion of a character into the first string, the deletion of a character from the first string, or the substitution (or replacement) of a character in the first string with a character in the second string.
Sequence comparison, particularly when combined with the systematic collection, curration, and search of databases containing biomolecular sequences, has become essential in modern molecular biology. Commenting on the (then) near-completion of the effort to sequence the entire yeast genome (now finished), Stephen Oliver says
In a short time it will be hard to realize how we managed without the sequence data. Biology will never be the same again. [478]
One fact explains the importance of molecular sequence data and sequence comparison in biology.
The first fact of biological sequence analysis
The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity.
Evolution reuses, builds on, duplicates, and modifies “successful” structures (proteins, exons, DNA regulatory sequences, morphological features, enzymatic pathways, etc.). Life is based on a repertoire of structured and interrelated molecular building blocks that are shared and passed around. The same and related molecular structures and mechanisms show up repeatedly in the genome of a single species and across a very wide spectrum of divergent species. “Duplication with modification” [127, 128, 129, 130] is the central paradigm of protein evolution, wherein new proteins and/or new biological functions are fashioned from earlier ones. Doolittle emphasizes this point as follows:
The vast majority of extant proteins are the result of a continuous series of genetic duplications and subsequent modifications.
We will see many applications of suffix trees throughout the book. Most of these applications allow surprisingly efficient, linear-time solutions to complex string problems. Some of the most impressive applications need an additional tool, the constant-time lowest common ancestor algorithm, and so are deferred until that algorithm has been discussed (in Chapter 8). Other applications arise in the context of specific problems that will be discussed in detail later. But there are many applications we can now discuss that illustrate the power and utility of suffix trees. In this chapter and in the exercises at its end, several of these applications will be explored.
Perhaps the best way to appreciate the power of suffix trees is for the reader to spend some time trying to solve the problems discussed below, without using suffix trees. Without this effort or without some historical perspective, the availability of suffix trees may make certain of the problems appear trivial, even though linear-time algorithms for those problems were unknown before the advent of suffix trees. The longest common substring problem discussed in Section 7.4 is one clear example, where Knuth had conjectured that a linear-time algorithm would not be possible [24, 278], but where such an algorithm is immediate with the use of suffix trees. Another classic example is the longest prefix repeat problem discussed in the exercises, where a linear-time solution using suffix trees is easy, but where the best prior method ran in O(n log n) time.
With the ability to solve lowest common ancestor queries in constant time, suffix trees can be used to solve many additional string problems. Many of those applications move from the domain of exact matching to the domain of inexact, or approximate, matching (matching with some errors permitted). This chapter illustrates that point with several examples.
Longest common extension: a bridge to inexact matching
The longest common extension problem is solved as a subtask in many classic string algorithms. It is at the heart of all but the last application discussed in this chapter and is central to the k-difference algorithm discussed in Section 12.2.
Longest common extension problem Two strings S1 and S2 of total length n are first specified in a preprocessing phase. Later, a long sequence of index pairs is specified. For each specified index pair (i, j), we must find the length of the longest substring of S1 starting at position i that matches a substring of S2 starting at position j. That is, we must find the length of the longest prefix of suffix i of S1 that matches a prefix of suffix j of S2 (see Figure 9.1).
Of course, any time an index pair is specified, the longest common extension can be found by direct search in time proportional to the length of the match. But the goal is to compute each extension in constant time, independent of the length of the match.
In Section 15.11.3, we discussed the canonical advice of translating any newly sequenced gene into a derived amino acid sequence to search the protein databases for similarities to the new sequence. This is in contrast to searching DNA databases with the original DNA string. There is, however, a technical problem with using derived amino acid sequences. If a single nucleotide is missing from the DNA transcript, then the reading frame of the succeeding DNA will be changed (see Figure 18.1). A similar problem occurs if a nucleotide is incorrectly inserted into the transcript. Until the correct reading frame is reestablished (through additional errors), most of the translated amino acids will be incorrect, invalidating most comparisons made to the derived amino acid sequence.
Insertion and deletion errors during DNA sequencing are fairly common, so frameshift errors can be serious in the subsequent analysis. Those errors are in addition to any substitution errors that leave the reading frame unchanged. Moreover, informative alignments often contain a relatively small number of exactly matching characters and larger regions of more poorly aligned substrings (see Section 11.7 on local alignment). Therefore, two substrings that would align well without a frameshift error but would align poorly with one can easily be mistaken for regions that align poorly due only to substitution errors. Therefore, without some additional technique, it is easy to miss frameshift errors and hard to correct them.
We now turn to the dominant, most mature and most successful application of string algorithms in computational biology: the building and searching of databases holding molecular sequence data. We start by illustrating some uses of sequence databases and by describing a bit about existing databases. The impact of database searching (and expectations of greater future impact), explains a large part of the interest among biologists for algorithms that search, manipulate and compare strings. In turn, the biologists' activities have stimulated additional interest in string algorithms among computer scientists.
After describing the “why” and the “what” of sequence databases, we will discuss some “how” issues in string algorithms that are particular to database organization and search.
Why sequence databases?
Comprehensive databases/archives holding DNA and protein sequences are firmly established as central tools in current molecular biology – “electronic databases are fast becoming the lifeblood of the field” [452]. The fundamental reason is the power of biomolecular sequence comparison. This was made explicit in the first fact of biological sequence analysis (Chapter 10, page 212). Given the effectiveness of sequence comparison in molecular biology, it is natural to stockpile and systematically organize the biosequences to be compared; this has naturally led to the growth of sequence databases. We start this chapter with a few illustrations of the power of sequence comparison in the form of sequence database search.
The dominant view of the evolution of life is that all existing organisms are derived from some common ancestor and that a new species arises by a splitting of one population into two (or more) populations that do not cross-breed, rather than by a mixing of two populations into one. Therefore, the high level history of life is ideally organized and displayed as a rooted, directed tree. The extant species (and some of the extinct species) are represented at the leaves of the tree, each internal node represents a point when the history of two sets of species diverged (or represents a common ancestor of those species), the length and direction of each edge represents the passage of time or the evolutionary events that occur in that time, and so the path from the root of the tree to each leaf represents the evolutionary history of the organisms represented there. To quote Darwin
… the great Tree of Life fills with its dead and broken branches the crust of the earth, and covers the surface with its ever-branching and beautiful ramifications.
[119]
This view of the history of life as a tree must frequently be modified when considering the evolution of viruses, or even bacteria or individual genes, but remains the dominant way that high-level evolution is viewed in current biology. Hundreds (may be thousands) of papers are published yearly that depict deduced evolutionary trees.
In the previous three parts of the book we developed general techniques and specific string algorithms whose importance is either already well established or is likely to be established. We expect that the material of those three parts will be relevant to the field of string algorithms and molecular sequence analysis for many years to come. In this final part of the book we branch out from well established techniques and from problems strictly defined on strings. We do this in three ways.
First, we discuss techniques that are very current but may not stand the test of time although they may lead to more powerful and effective methods. Similarly, we discuss string problems that are tied to current technology in molecular biology but may become less important as that technology changes.
Second, we discuss problems, such as physical mapping, fragment assembly, and building phylogenetic (evolutionary) trees, that, although related to string problems, are not themselves string problems. These cousins of string problems either motivate specific pure string problems or motivate string problems generally by providing a more complete picture of how biological sequence data are obtained, or they use the output of pure string algorithms.
Third, we introduce a few important cameo topics without giving as much depth and detail as has generally been given to other topics in the book.
Of course, some topics to be presented in this final part of the book cross the three categories and are simultaneously currents, cousins, and cameos.
We will present two methods for constructing suffix trees in detail, Ukkonen's method and Weiner's method. Weiner was the first to show that suffix trees can be built in linear time, and his method is presented both for its historical importance and for some different technical ideas that it contains. However, Ukkonen's method is equally fast and uses far less space (i.e., memory) in practice than Weiner's method. Hence Ukkonen is the method of choice for most problems requiring the construction of a suffix tree. We also believe that Ukkonen's method is easier to understand. Therefore, it will be presented first. A reader who wishes to study only one method is advised to concentrate on it. However, our development of Weiner's method does not depend on understanding Ukkonen's algorithm, and the two algorithms can be read independently (with one small shared section noted in the description of Weiner's method).
Ukkonen's linear-time suffix tree algorithm
Esko Ukkonen [438] devised a linear-time algorithm for constructing a suffix tree that may be the conceptually easiest linear-time construction algorithm. This algorithm has a space-saving improvement over Weiner's algorithm (which was achieved first in the development of McCreight's algorithm), and it has a certain “on-line” property that may be useful in some situations. We will describe that on-line property but emphasize that the main virtue of Ukkonen's algorithm is the simplicity of its description, proof, and time analysis.
A look at some DNA mapping and sequencing problems
In this chapter we consider a number of theoretical and practical issues in creating and using genome maps and in large-scale (genomic) DNA sequencing. These areas are considered in this book for two reasons: First, we want to more completely explain the origin of molecular sequence data, since string problems on such data provide a large part of the motivation for studying string algorithms in general. Second, we need to more completely explain specific problems on strings that arise in obtaining molecular sequence data.
We start with a discussion of mapping in general and the distinction between physical maps and genetic maps. This leads to the discussion of several physical mapping techniques such as STS-content mapping and radiation-hybrid mapping. Our discussion emphasizes the combinatorial and computational aspects common to those techniques. We follow with a discussion of the tightest layout problem, and a short introduction to map comparison and map alignment. Then we move to large-scale sequencing and its relation to physical mapping. We emphasize shotgun sequencing and the string problems involved in sequence assembly under the shotgun strategy. Shotgun sequencing leads naturally to a beautiful pure string problem, the shortest common superstring problem. This pure, exact string problem is motivated by the practical problem of shotgun sequence assembly and deserves attention if only for the elegance of the results that have been obtained.
In this chapter we begin the discussion of multiple string comparison, one of the most important methodological issues and most active research areas in current biological sequence analysis. We first discuss some of the reasons for the importance of multiple string comparison in molecular biology. Then we will examine multiple string alignment, one common way that multiple string comparison has been formalized. We will precisely define three variants of the multiple alignment problem and consider in depth algorithms for attacking those problems. Other variants will be sketched in this chapter; additional multiple alignment issues will be discussed in Part IV.
Why multiple string comparison?
For a computer scientist, the multiple string comparison problem may at first seem like a generalization for generalization's sake – “two strings good, four strings better”. But in the context of molecular biology, multiple string comparison (of DNA, RNA, or protein strings) is much more than a technical exercise. It is the most critical cutting-edge tool for extracting and representing biologically important, yet faint or widely dispersed, commonalities from a set of strings. These (faint) commonalities may reveal evolutionary history, critical conserved motifs or conserved characters in DNA or protein, common two- and three-dimensional molecular structure, or clues about the common biological function of the strings. Such commonalities are also used to characterize families or superfamilies of proteins. These characterizations are then used in database searches to identify other potential members of a family.
Although I didn't know it at the time, I began writing this book in the summer of 1988 when I was part of a computer science (early bioinformatics) research group at the Human Genome Center of Lawrence Berkeley Laboratory. Our group followed the standard assumption that biologically meaningful results could come from considering DNA as a one-dimensional character string, abstracting away the reality of DNA as a flexible three-dimensional molecule, interacting in a dynamic environment with protein and RNA, and repeating a life-cycle in which even the classic linear chromosome exists for only a fraction of the time. A similar, but stronger, assumption existed for protein, holding, for example, that all the information needed for correct three-dimensional folding is contained in the protein sequence itself, essentially independent of the biological environment the protein lives in. This assumption has recently been modified, but remains largely intact [297].
For nonbiologists, these two assumptions were (and remain) a god send, allowing rapid entry into an exciting and important field. Reinforcing the importance of sequence-level investigation were statements such as:
The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data structure of an organism's biology. [352]
Plotkin (1975) showed that the lambda calculus is a good model of the evaluation process for call-by-name functional programs. Reducing programs to constants or lambda abstractions according to the leftmost-outermost strategy exactly mirrors execution on an abstract machine like Landin's SECD machine. The machine-based evaluator returns a constant or the token closure if and only if the standard reduction sequence starting at the same program will end in the same constant or in some lambda abstraction. However, the calculus does not capture the sharing of the evaluation of arguments that lazy implementations use to speed up the execution. More precisely, a lazy implementation evaluates procedure arguments only when needed and then only once. All other references to the formal procedure parameter re-use the value of the first argument evaluation. The mismatch between the operational semantics of the lambda calculus and the actual behavior of the prototypical implementation is a major obstacle for compiler writers. Unlike implementors of the leftmost-outermost strategy or of a call-by-value language, implementors of lazy systems cannot easily explain the behavior of their evaluator in terms of source level syntax. Hence, they often cannot explain why a certain syntactic transformation ‘works’ and why another doesn't. In this paper we develop an equational characterization of the most popular lazy implementation technique – traditionally called ‘call-by-need’ – and prove it correct with respect to the original lambda calculus. The theory is a strictly smaller theory than Plotkin's call-by-name lambda calculus. Immediate applications of the theory concern the correctness proofs of a number of implementation strategies, e.g. the call-by-need continuation passing transformation and the realization of sharing via assignments. Some of this material first appeared in a paper presented at the 1995 ACM Conference on the Principles of Programming Languages. The paper was a joint effort with Maraist, Odersky and Wadler, who had independently developed a different equational characterization of call-by-need. We contrast our work with that of Maraist et al. in the body of this paper where appropriate.
It is well known that computed torque robot control is subjected toperformance degradation due to uncertainties in robot model, and application ofneural network (NN) compensation techniques are promising. In this paper weexamine the effectiveness of neural network (NN) as a compensator for thecomplex problem of Cartesian space control. In particular we examine thedifferences in system performance of accurate position control when the same NNcompensator is applied at different locations in the controller structure. It isfound that using NN to modify the reference trajectory to compensate for modeluncertainties is much more effective than the traditional approach of modifyingcontrol input or joint torque/force. To facilitate the analysis, a new NNtraining signal is introduced and used for all cases. The study is also extendedto non-model based Cartesian control problems. Simulation results withthree-link rotary robot are presented and performances of different compensatinglocations are compared.
Isotropic velocity radius (IVR) and isotropic accelerationradius (IVR) are proposed as local performance indices to quantify thedynamic responsiveness of a multi-arm robot as regards the velocity effects.These performance measures are defined on the basis of the acceleration setdescribing the effects of actuator torques, velocity of the manipulated object,and gravity upon the acceleration of the object. An algorithm is presented toobtain an explicit expression for calculating these measures by decomposing thetorque-related non-square matrix into square matrices without using the conceptof pseudo-inverse for the planar multi-arm robot composed of arms each with twojoints.
Numerical examples for a planar robot with two 2R arms show thatthe proposed concepts are effective in representing the acceleration capabilityand velocity effects of a multi-arm robot.