To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In the two previous chapters we presented two models for machine translation, one based on the translation of words, and another based on the translation of phrases as atomic units. Both models were defined as mathematical formulae that, given a possible translation, assign a probabilistic score to it.
The task of decoding in machine translation is to find the best scoring translation according to these formulae. This is a hard problem, since there is an exponential number of choices, given a specific input sentence. In fact, it has been shown that the decoding problem for the presented machine translation models is NP-complete [Knight, 1999a]. In other words, exhaustively examining all possible translations, scoring them, and picking the best is computationally too expensive for an input sentence of even modest length.
In this chapter, we will present a number of techniques that make it possible to efficiently carry out the search for the best translation. These methods are called heuristic search methods. This means that there is no guarantee that they will find the best translation, but we do hope to find it often enough, or at least a translation that is very close to it.
Will decoding find a good translation for a given input? Note that there are two types of error that may prevent this. A search error is the failure to find the best translation according to the model, in other words, the highest-probability translation.
This chapter will introduce concepts in statistics, probability theory, and information theory. It is not a comprehensive treatment, but will give the reader a general understanding of the principles on which the methods in this book are based.
Estimating Probability Distributions
The use of probability in daily life is often difficult. It seems that the human mind is best suited to dealing with certainties and clear outcomes for planned actions. Probability theory is used when outcomes are less certain, and many possibilities exist.
Consider the statement in a weather report: On Monday, there is a 20% chance of rain. What does that reference to 20% chance mean? Ultimately, we only have to wait for Monday, and see if it rains or not. So, this seems to be less a statement about facts than about our knowledge of the facts. In other words, it reflects our uncertainty about the facts (in this case, the future weather).
For the human mind dealing with probabilistic events creates complexities. To address a 20% chance of rain, we cannot decide to carry 20% of an umbrella. We either risk carrying unnecessary weight with us or risk getting wet. So, a typical response to this piece of information is to decide that it will not rain and ignore the less likely possibility. We do this all the time.
Over the last few centuries, machines have taken on many human tasks, and lately, with the advent of digital computers, even tasks that were thought to require thinking and intelligence. Translating between languages is one of these tasks, a task for which even humans require special training.
Machine translation has a long history, but over the last decade or two, its evolution has taken on a new direction – a direction that is mirrored in other subfields of natural language processing. This new direction is grounded in the premise that language is so rich and complex that it could never be fully analyzed and distilled into a set of rules, which are then encoded into a computer program. Instead, the new direction is to develop a machine that discovers the rules of translation automatically from a large corpus of translated text, by pairing the input and output of the translation process, and learning from the statistics over the data.
Statistical machine translation has gained tremendous momentum, both in the research community and in the commercial sector. About one thousand academic papers have been published on the subject, about half of them in the past three years alone. At the same time, statistical machine translation systems have found their way to the marketplace, ranging from the first purely statistical machine translation company, Language Weaver, to the free online systems of Google and Microsoft.
This chapter is intended for readers who have little or no background in natural language processing. We introduce basic linguistics concepts and explain their relevance to statistical machine translation. Starting with words and their linguistic properties, we move up to sentences, issues of syntax and semantics. We also discuss the role text corpora play in the building of a statistical machine translation system and the basic methods used to obtain and prepare these data sources.
Words
Intuitively, the basic atomic unit of meaning is a word. For instance, the word house evokes the mental image of a rectangular building with a roof and smoking chimney, which may be surrounded by grass and trees and inhabited by a happy family. When house is used in a text, the reader adapts this meaning based on the context (her grandmother's house is different from the White House), but we are able to claim that almost all uses of house are connected to that basic unit of meaning.
There are smaller units, such as syllables or sounds, but a hou or s do not evoke the mental image or meaning of house. The notion of a word seems straightforward for English readers. The text you are reading right now has its words clearly separated by spaces, so they stand out as units.
The currently best performing statistical machine translation systems are based on phrase-based models: models that translate small word sequences at a time. This chapter explains the basic principles of phrase-based models and how they are trained, and takes a more detailed look at extensions to the main components: the translation model and the reordering model. The next chapter will explain the algorithms that are used to translate sentences using these models.
Standard Model
First, we lay out the standard model for phrase-based statistical machine translation. While there are many variations, these can all be seen as extensions to this model.
Motivation for Phrase-Based Models
The previous chapter introduced models for machine translation that were based on the translation of words. But words may not be the best candidates for the smallest units for translation. Sometimes one word in a foreign language translates into two English words, or vice versa. Word-based models often break down in these cases.
Consider Figure 5.1, which illustrations how phrase-based models work. The German input sentence is first segmented into so-called phrases(any multiword units). Then, each phrase is translated into an English phrase. Finally, phrases may be reordered. In Figure 5.1, the six German words and eight English words are mapped as five phrase pairs.
One essential component of any statistical machine translation system is the language model, which measures how likely it is that a sequence of words would be uttered by an English speaker. It is easy to see the benefits of such a model. Obviously, we want a machine translation system not only to produce output words that are true to the original in meaning, but also to string them together in fluent English sentences.
In fact, the language model typically does much more than just enable fluent output. It supports difficult decisions about word order and word translation. For instance, a probabilistic language model pLM should prefer correct word order to incorrect word order:
pLM(the house is small) > pLM(small the is house)
Formally, a language model is a function that takes an English sentence and returns the probability that it was produced by an English speaker. According to the example above, it is more likely that an English speaker would utter the sentence the house is small than the sentence small the is house. Hence, a good language model pLM assigns a higher probability to the first sentence.
This preference of the language model helps a statistical machine translation system to find the right word order. Another area where the language model aids translation is word choice. If a foreign word (such as the German Haus) has multiple translations (house, home, …), lexical translation probabilities already give preference to the more common translation (house).
Constraints are everywhere: most computational problems can be described in terms of restrictions imposed on the set of possible solutions, and constraint programming is a problem-solving technique that works by incorporating those restrictions in a programming environment. It draws on methods from combinatorial optimisation and artificial intelligence, and has been successfully applied in a number of fields from scheduling, computational biology, finance, electrical engineering and operations research through to numerical analysis. This textbook for upper-division students provides a thorough and structured account of the main aspects of constraint programming. The author provides many worked examples that illustrate the usefulness and versatility of this approach to programming, as well as many exercises throughout the book that illustrate techniques, test skills and extend the text. Pointers to current research, extensive historical and bibliographic notes, and a comprehensive list of references will also be valuable to professionals in computer science and artificial intelligence.
To work effectively in information-rich environments, knowledge workers must be able to distil the most appropriate information from the deluge of information available to them. This is difficult to do manually. Natural language engineers can support these workers by developing information delivery tools, but because of the wide variety of contexts in which information is acquired and delivered, these tools have tended to be domain-specific, ad hoc solutions that are hard to generalise. This paper discusses Myriad, a platform that generalises the integration of sets of resources to a variety of information delivery contexts. Myriad provides resources from natural language generation for discourse planning as well as a service-based architecture for data access. The nature of Myriad's resources is driven by engineering concerns. It focuses on resources that reason about and generate from coarse-grained units of information, likely to be provided by existing information sources, and it supports the integration of pipe-lined planning and template mechanisms. The platform is illustrated in the context of three information delivery applications and is evaluated with respect to its utility.
This book is an investigation into the problems of generating natural language utterances to satisfy specific goals the speaker has in mind. It is thus an ambitious and significant contribution to research on language generation in artificial intelligence, which has previously concentrated in the main on the problem of translation from an internal semantic representation into the target language. Dr. Appelt's approach, based on a possible-worlds semantics of an intensional logic of knowledge and action, enables him to develop a formal representation of the effects of illocutionary acts and the speaker's beliefs about the hearer's knowledge of the world. The theory is embodied and illustrated in a computer system, KAMP (Knowledge and Modalities Planner), described in the book.
A primary problem in the area of natural language processing has been that of semantic analysis. This book aims to look at the semantics of natural languages in context. It presents an approach to the computational processing of English text that combines current theories of knowledge representation and reasoning in Artificial Intelligence with the latest linguistic views of lexical semantics. This results in distinct advantages for relating the semantic analysis of a sentence to its context. A key feature is the clear separation of the lexical entries that represent the domain-specific linguistic information from the semantic interpreter that performs the analysis. The criteria for defining the lexical entries are firmly grounded in current linguistic theories, facilitating integration with existing parsers. This approach has been implemented and tested in Prolog on a domain for physics word problems and full details of the algorithms and code are presented. Semantic Processing for Finite Domains will appeal to postgraduates and researchers in computational linguistics, and to industrial groups specializing in natural language processing.
Constraint logic programming lies at the intersection of logic programming, optimisation and artificial intelligence. It has proved a successful tool in many areas including production planning, transportation scheduling, numerical analysis and bioinformatics. Eclipse is one of the leading software systems that realise its underlying methodology. Eclipse is exploited commercially by Cisco, and is freely available and used for teaching and research in over 500 universities. This book has a two-fold purpose. It's an introduction to constraint programming, appropriate for one-semester courses for upper undergraduate or graduate students in computer science or for programmers wishing to master the practical aspects of constraint programming. By the end of the book, the reader will be able to understand and write constraint programs that solve complex problems. Second, it provides a systematic introduction to the Eclipse system through carefully-chosen examples that guide the reader through the language and illustrate its power, versatility and utility.
The goal of identifying textual entailment – whether one piece of text can be plausibly inferred from another – has emerged in recent years as a generic core problem in natural language understanding. Work in this area has been largely driven by the PASCAL Recognizing Textual Entailment (RTE) challenges, which are a series of annual competitive meetings. The current work exhibits strong ties to some earlier lines of research, particularly automatic acquisition of paraphrases and lexical semantic relationships and unsupervised inference in applications such as question answering, information extraction and summarization. It has also opened the way to newer lines of research on more involved inference methods, on knowledge representations needed to support this natural language understanding challenge and on the use of learning methods in this context. RTE has fostered an active and growing community of researchers focused on the problem of applied entailment. This special issue of the JNLE provides an opportunity to showcase some of the most important work in this emerging area.
Those AI researchers called logicists, who favor the use of logical languages for representing knowledge and the use of logical methods for reasoning, acknowledge one problem with ordinary logic; namely, it is monotonic. By that they mean that the set of logical conclusions that can be drawn from a set of logical statements does not decrease as more statements are added to the set. If one could prove a statement from a given knowledge base, one could still prove that same statement (with the very same proof!) when more knowledge is added.
Yet, much human reasoning does not seem to work that way – a fact well noticed (and celebrated) by AI's critics. Often, we jump to a conclusion using the facts we happen to have, together with reasonable assumptions, and then have to retract that conclusion when we learn some new fact that contradicts the assumptions. That style of reasoning is called nonmonotonic or defeasible (meaning “capable of being made or declared null and void”) because new facts might require taking back something concluded before.
One can even find examples of nonmonotonic reasoning in children's stories. In That's Good! That's Bad!, by Margery Cuyler, a little boy floats high into the sky holding on to a balloon his parents bought him at the zoo. “Wow! Oh, that's good,” the story goes.
For a system to be intelligent, it must have knowledge about its world and the means to draw conclusions from, or at least act on, that knowledge. Humans and machines alike therefore must have ways to represent this needed knowledge in internal structures, whether encoded in protein or silicon. Cognitive scientists and AI researchers distinguish between two main ways in which knowledge is represented: procedural and declarative. In animals, the knowledge needed to perform a skilled action, such as hitting a tennis ball, is called procedural because it is encoded directly in the neural circuits that coordinate and control that specific action. Analogously, automatic landing systems in aircraft contain within their control programs procedural knowledge about flight paths, landing speeds, aircraft dynamics, and so on. In contrast, when we respond to a question, such as “How old are you?,” we answer with a declarative sentence, such as “I am twenty-four years old.” Any knowledge that is most naturally represented by a declarative sentence is called declarative.
In AI research (and in computer science generally), procedural knowledge is represented directly in the programs that use that knowledge, whereas declarative knowledge is represented in symbolic structures that are more-or-less separate from the many different programs that might use the information in those structures. Examples of declarative-knowledge symbol structures are those that encode logical statements (such as those McCarthy advocated for representing world knowledge) and those that encode semantic networks (such as those of Raphael or Quillian). Typically, procedural representations, specialized as they are to particular tasks, are more efficient (when performing those tasks), whereas declarative ones, which can be used by a variety of different programs, are more generally useful.