To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Let us recall the definition: a word f in A* is a finite sequence of elements of A, called letters. We shall call a subword of a word f any sequence contained in the sequence f. The word aba for instance is a subword of the word bacbcab as well as of the word aabbaa. It can be observed immediately that two sub-sequences of f, distinct as subsequences, may define the same subword: thus aba is a subword of bacbcab in only one way but may be obtained as a subword of aabbaa in eight different ways.
A word f being given it is easy to compute the set of its subwords and their multiplicity; this computation is obtained by a simple induction formula. The main problem of interest in this chapter, sometimes implicitly but more often explicitly, is the one of the inverse correspondence. Under what conditions is a given set of words S the set of subwords, or a subset of certain kind of the set of subwords, of a word f? Once these conditions are met, what are the words f that are thus determined? In which cases are they uniquely determined? Some of these conditions on that set S are rather obvious. For instance if g is a subword of f, then any subword of g is a subword of f.
The aim of this chapter is to give a detailed presentation of the relation between plane trees and special families of words: parenthesis systems and other families. The relation between trees and parenthesis notation is classical and has been known perhaps since Catalan 1838.
Because trees play a central role in the field of combinatorial algorithms (Knuth 1968), their coding by parenthesis notation has been investigated so very often that it is quite impossible to give a complete list of all the papers dealing with the topic. These subjects are also considered in enumeration theory and are known to combinatorialists (Comtet 1970) as being counted by Catalan numbers. Note that a generalization of the type of parenthesis system often called Dyck language is a central concept in formal language theory. These remarks give a good account of the main role played by trees and their coding in combinatorics on words.
Presented here are three ways to represent trees by words. The first one consists in constructing a set of words (one for each node) associated to a plane tree. The second is the classical parenthesis coding, and the third concerns Lukaciewicz language (known also as Polish notation).
The combinatorial properties of Lukaciewicz language were investigated by Raney (1960) in order to give a purely combinatorial proof of the Lagrange inversion formula (see also Schiitzenberger 1971). This proof is presented in Section 11.4 of the present chapter as an application of our combinatorial constructions.
This chapter is devoted to a study of van der Waerden's theorem, which is, according to Khinchin, one of the “pearls of number theory.” This theorem illustrates a principle of unavoidable regularity: It is impossible to produce long sequences of elements taken from a finite set that do not contain subsequences possessing some regularity, in this instance arithmetic progressions of identical elements.
During the last fifty years, van der Waerden's theorem has stimulated a good deal of research on various aspects of the result. Efforts have been made to simplify the proof while at the same time generalizing the theorem, as well as to determine certain numerical constants that occur in the statement of the theorem. This work is of an essentially combinatorial nature. More recently, results from ergodic theory have led to the discovery of new extensions of van der Waerden's theorem, and, as a result, to a topological proof.
The plan of the chapter illustrates this diversity of viewpoints. The first section, after a brief historical note, presents several different formulations of van der Waerden's theorem. The second section gives a combinatorial proof of an elegant generalization due to Grünwald. The third section, which concerns “cadences,” gives an interpretation of the theorem in terms of the free monoid. In the fourth section is presented a topological proof of van der Waerden's theorem, due to Fürstenberg and Weiss.
Combinatorics on words is a field that has grown separately within several branches of mathematics, such as group theory or probabilities, and appears frequently in problems of computer science dealing with automata and formal languages. It may now be considered as an independent theory because of both the number of results that it contains and the variety of possible applications.
This book is the first attempt to present a unified treatment of the theory of combinatorics on words. It covers the main results and methods in an elementary presentation and can be used as a textbook in mathematics or computer science at undergraduate or graduate level. It will also help researchers in these fields by putting together a lot of results scattered in the literature.
The idea of writing this book arose a few years ago among the group of people who have collectively realized it. The starting point was a mimeo-graphed text of lectures given by M. P. Schützenberger at the University of Paris in 1966 and written down by J. F. Perrot. The title of this text was “Quelques Problèmes combinatoires de la théorie des automates.” It was widely circulated and served many people (including most of the authors of this book) as an introduction to this field.
Given a string P called the pattern and a longer string T called the text, the exact matching problem is to find all occurrences, if any, of pattern P in text T.
For example, if P = aba and T = bbabaxababay then P occurs in T starting at locations 3, 7, and 9. Note that two occurrences of P may overlap, as illustrated by the occurrences of P at locations 7 and 9.
Importance of the exact matching problem
The practical importance of the exact matching problem should be obvious to anyone who uses a computer. The problem arises in widely varying applications, too numerous to even list completely. Some of the more common applications are in word processors; in utilities such as grep on Unix; in textual information retrieval programs such as Medline, Lexis, or Nexis; in library catalog searching programs that have replaced physical card catalogs in most large libraries; in internet browsers and crawlers, which sift through massive amounts of text available on the internet for material containing specific keywords; in internet news readers that can search the articles for topics of interest; in the giant digital libraries that are being planned for the near future; in electronic journals that are already being “published” on-line; in telephone directory assistance; in on-line encyclopedias and other educational CD-ROM applications; in on-line dictionaries and thesauri, especially those with cross-referencing features (the Oxford English Dictionary project has created an electronic on-line version of the OED containing 50 million words); and in numerous specialized databases.
All of the exact matching methods in the first three chapters, as well as most of the methods that have yet to be discussed in this book, are examples of comparison-based methods. The main primitive operation in each of those methods is the comparison of two characters. There are, however, string matching methods based on bit operations or on arithmetic, rather than character comparisons. These methods therefore have a very different flavor than the comparison-based approaches, even though one can sometimes see character comparisons hidden at the inner level of these “seminumerical” methods. We will discuss three examples of this approach: the Shift-And method and its extension to a program called agrep to handle inexact matching; the use of the Fast Fourier Transform in string matching; and the random fingerprint method of Karp and Rabin.
The Shift-And method
R. Baeza-Yates and G. Gonnet [35] devised a simple, bit-oriented method that solves the exact matching problem very efficiently for relatively small patterns (the length of a typical English word for example). They call this method the Shift-Or method, but it seems more natural to call it Shift-And. Recall that pattern P is of size n and the text T is of size m.
Definition Let M be an n by m + 1 binary valued array, with index i running from 1 to n and index j running from 1 to m.
In this chapter we look in detail at alignment problems in the more complex contexts typical of string problems that currently arise in computational molecular biology. These more complex problems require techniques that extend (rather than refine) the core alignment methods.
Parametric sequence alignment
Introduction
When using sequence alignment methods to study DNA or amino acid sequences, there is often considerable disagreement about how to weight matches, mismatches, insertions and deletions (indels), and gaps. The most commonly used alignment software packages require the user to specify fixed values for those parameters, and it is widely observed that the biological significance of the resulting alignment can be greatly affected by the choice of parameter settings. The following relates to alignments of proteins from the globin family and is representative of frequently seen comments in the biological literature:
…one must be able to vary the gap and gap size penalties independently and in a query dependent fashion in order to obtain the maximal sensitivity of the search.
[81]
A similar comment appears in [432]:
Sequence alignment is sensitive to the choices of gap penalty and the form of the relatedness matrix, and it is often desirable to vary these …
Finally, from [446],
One of the most prominent problems is the choice of parametric values, especially gap penalties. When very similar sequences are compared, the choice is not critical; but when the conservation is low, the resulting alignment is strongly affected.
A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the fundamental preprocessing discussed in Section 1.3. Suffix trees can be used to solve the exact matching problem in linear time (achieving the same worst-case bound that the Knuth-Morris-Pratt and the Boyer–Moore algorithms achieve), but their real virtue comes from their use in linear-time solutions to many string problems more complex than exact matching. Moreover (as we will detail in Chapter 9), suffix trees provide a bridge between exact matching problems, the focus of Part I, and inexact matching problems that are the focus of Part III.
The classic application for suffix trees is the substring problem. One is first given a text T of length m. After O(m), or linear, preprocessing time, one must be prepared to take in any unknown string S of length n and in O(n) time either find an occurrence of S in T or determine that S is not contained in T. That is, the allowed preprocessing takes time proportional to the length of the text, but thereafter, the search for S must be done in time proportional to the length of S, independent of the length of T. These bounds are achieved with the use of a suffix tree. The suffix tree for the text is built in O(m) time during a preprocessing stage; thereafter, whenever a string of length O(n) is input, the algorithm searches for it in O(n) time using that suffix tree.
String search, edit, and alignment tools have been extensively used in studies of molecular evolution. However, their use has primarily been aimed at comparing strings representing single genes or single proteins. For example, evolutionary studies have usually selected a single protein and have examined how the amino acid sequence for that protein differs in different species. Accordingly, string edit and alignment algorithms have been guided by objective functions that model the most common types of mutations occurring at the level of a single gene or protein: point mutations or amino acid substitutions, single character insertions and deletions, and block insertions and deletions (gaps).
Recently, attention has been given to mutations that occur on a scale much larger than the single gene. These mutations occur at the chromosome or at the genome level and are central in the evolution of the whole genome. These larger-scale mutations have features that can be quite different from gene- or protein-level mutations. With more genome-level molecular data becoming available, larger-scale string comparisons may give insights into evolution that are not seen at the single gene or protein level.
The guiding force behind genome evolution is “duplication with modification” [126, 128, 301, 468]. That is, parts of the genome are duplicated, possibly very far away from the original site, and then modified. Other genome-level mutations of importance include inversions, where a segment of DNA is reversed; translocations, where the ends of two chromosomes (telomeres) are exchanged; and transpositions, where two adjacent segments of DNA exchange places.
Almost all discussions of exact matching begin with the naive method, and we follow this tradition. The naive method aligns the left end of P with the left end of T and then compares the characters of P and T left to right until either two unequal characters are found or until P is exhausted, in which case an occurrence of P is reported. In either case, P is then shifted one place to the right, and the comparisons are restarted from the left end of P. This process repeats until the right end of P shifts past the right end of T.
Using n to denote the length of P and m to denote the length of T, the worst-case number of comparisons made by this method is Θ(nm). In particular, if both P and T consist of the same repeated character, then there is an occurrence of P at each of the first m − n + 1 positions of T and the method performs exactly n(m − n + 1) comparisons. For example, if P = aaa and T = aaaaaaaaaa then n = 3, m = 10, and 24 comparisons are made.
The naive method is certainly simple to understand and program, but its worst-case running time of Θ(nm) may be unsatisfactory and can be improved. Even the practical running time of the naive method may be too slow for larger texts and patterns.
In this book I have tried to present fundamental ideas, algorithms, and techniques that have a wide range of application and that will likely remain important even as the present-day interests change. I have also tried to explain the fundamental reasons why computations on strings and sequences are productive in biology and will remain important even as the specific applications change. But with only 500 pages (a mere 285,639 words formed from 1,784,996 characters), there are certain algorithmic methods and certain present and anticipated applications that I could not cover.
Additional techniques
For additional pure computer science results on exact matching, the reader is referred to Text Algorithms by M. Crochemore and W. Rytter [117]. That book goes more deeply into several pure computer science issues, such as periodicities in strings and parallel algorithms. For a survey of many string searching algorithms and inexact matching methods, see String Searching Algorithms by G. Stephen [421]. For additional topics in computational molecular biology, particularly probabilistic and statistical questions about strings and sequences, see An Introduction to Computational Biology by M. Waterman [461]. For another introduction to combinatorial and string problems in computational molecular biology, see Introduction to Computational Molecular Biology, by J. Setubal and J. Meidanis [402]. For topics in computational molecular biology more focused on issues of protein structure, see the chapter Computational Molecular Biology by A. Lesk in [297].
A Boyer–Moore variant with a “simple” linear time bound
Apostolico and Giancarlo [26] suggested a variant of the Boyer–Moore algorithm that allows a fairly simple proof of linear worst-case running time. With this variant, no character of T will ever be compared after it is first matched with any character of P. It is then immediate that the number of comparisons is at most 2m: Every comparison is either a match or a mismatch; there can only be m mismatches since each one results in a nonzero shift of P; and there can only be m matches since no character of T is compared again after it matches a character of P. We will also show that (in addition to the time for comparisons) the time taken for all the other work in this method is linear in m.
Given the history of very difficult and partial analyses of the Boyer–Moore algorithm, it is quite amazing that a close variant of the algorithm allows a simple linear time bound. We present here a further improvement of the Apostolico–Giancarlo idea, resulting in an algorithm that simulates exactly the shifts of the Boyer–Moore algorithm. The method therefore has all the rapid shifting advantages of the Boyer–Moore method as well as a simple linear worst-case time analysis.
Key ideas
Our version of the Apostolico–Giancarlo algorithm simulates the Boyer–Moore algorithm, finding exactly the same mismatches that Boyer–Moore would find and making exactly the same shifts.
At this point we shift from the general area of exact matching and exact pattern discovery to the general area of inexact, approximate matching, and sequence alignment. “Approximate” means that some errors, of various types detailed later, are acceptable in valid matches. “Alignment” will be given a precise meaning later, but generally means lining up characters of strings, allowing mismatches as well as matches, and allowing characters of one string to be placed opposite spaces made in opposing strings.
We also shift from problems primarily concerning substrings to problems concerning subsequences. A subsequence differs from a substring in that the characters in a substring must be contiguous, whereas the characters in a subsequence embedded in a string need not be. For example, the string xyz is a subsequence, but not a substring, in axayaz. The shift from substrings to subsequences is a natural corollary of the shift from exact to inexact matching. This shift of focus to inexact matching and subsequence comparison is accompanied by a shift in technique. Most of the methods we will discuss in Part III, and many of the methods in Part IV, rely on the tool of dynamic programming, a tool that was not needed in Parts I and II.
Much of computational biology concerns sequence alignments
The area of approximate matching and sequence comparison is central in computational molecular biology both because of the presence of errors in molecular data and because of active mutational processes that (sub)sequence comparison methods seek to model and reveal.