To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Over the last few centuries, machines have taken on many human tasks, and lately, with the advent of digital computers, even tasks that were thought to require thinking and intelligence. Translating between languages is one of these tasks, a task for which even humans require special training.
Machine translation has a long history, but over the last decade or two, its evolution has taken on a new direction – a direction that is mirrored in other subfields of natural language processing. This new direction is grounded in the premise that language is so rich and complex that it could never be fully analyzed and distilled into a set of rules, which are then encoded into a computer program. Instead, the new direction is to develop a machine that discovers the rules of translation automatically from a large corpus of translated text, by pairing the input and output of the translation process, and learning from the statistics over the data.
Statistical machine translation has gained tremendous momentum, both in the research community and in the commercial sector. About one thousand academic papers have been published on the subject, about half of them in the past three years alone. At the same time, statistical machine translation systems have found their way to the marketplace, ranging from the first purely statistical machine translation company, Language Weaver, to the free online systems of Google and Microsoft.
This chapter is intended for readers who have little or no background in natural language processing. We introduce basic linguistics concepts and explain their relevance to statistical machine translation. Starting with words and their linguistic properties, we move up to sentences, issues of syntax and semantics. We also discuss the role text corpora play in the building of a statistical machine translation system and the basic methods used to obtain and prepare these data sources.
Words
Intuitively, the basic atomic unit of meaning is a word. For instance, the word house evokes the mental image of a rectangular building with a roof and smoking chimney, which may be surrounded by grass and trees and inhabited by a happy family. When house is used in a text, the reader adapts this meaning based on the context (her grandmother's house is different from the White House), but we are able to claim that almost all uses of house are connected to that basic unit of meaning.
There are smaller units, such as syllables or sounds, but a hou or s do not evoke the mental image or meaning of house. The notion of a word seems straightforward for English readers. The text you are reading right now has its words clearly separated by spaces, so they stand out as units.
The currently best performing statistical machine translation systems are based on phrase-based models: models that translate small word sequences at a time. This chapter explains the basic principles of phrase-based models and how they are trained, and takes a more detailed look at extensions to the main components: the translation model and the reordering model. The next chapter will explain the algorithms that are used to translate sentences using these models.
Standard Model
First, we lay out the standard model for phrase-based statistical machine translation. While there are many variations, these can all be seen as extensions to this model.
Motivation for Phrase-Based Models
The previous chapter introduced models for machine translation that were based on the translation of words. But words may not be the best candidates for the smallest units for translation. Sometimes one word in a foreign language translates into two English words, or vice versa. Word-based models often break down in these cases.
Consider Figure 5.1, which illustrations how phrase-based models work. The German input sentence is first segmented into so-called phrases(any multiword units). Then, each phrase is translated into an English phrase. Finally, phrases may be reordered. In Figure 5.1, the six German words and eight English words are mapped as five phrase pairs.
One essential component of any statistical machine translation system is the language model, which measures how likely it is that a sequence of words would be uttered by an English speaker. It is easy to see the benefits of such a model. Obviously, we want a machine translation system not only to produce output words that are true to the original in meaning, but also to string them together in fluent English sentences.
In fact, the language model typically does much more than just enable fluent output. It supports difficult decisions about word order and word translation. For instance, a probabilistic language model pLM should prefer correct word order to incorrect word order:
pLM(the house is small) > pLM(small the is house)
Formally, a language model is a function that takes an English sentence and returns the probability that it was produced by an English speaker. According to the example above, it is more likely that an English speaker would utter the sentence the house is small than the sentence small the is house. Hence, a good language model pLM assigns a higher probability to the first sentence.
This preference of the language model helps a statistical machine translation system to find the right word order. Another area where the language model aids translation is word choice. If a foreign word (such as the German Haus) has multiple translations (house, home, …), lexical translation probabilities already give preference to the more common translation (house).
This book is an investigation into the problems of generating natural language utterances to satisfy specific goals the speaker has in mind. It is thus an ambitious and significant contribution to research on language generation in artificial intelligence, which has previously concentrated in the main on the problem of translation from an internal semantic representation into the target language. Dr. Appelt's approach, based on a possible-worlds semantics of an intensional logic of knowledge and action, enables him to develop a formal representation of the effects of illocutionary acts and the speaker's beliefs about the hearer's knowledge of the world. The theory is embodied and illustrated in a computer system, KAMP (Knowledge and Modalities Planner), described in the book.
In this wide-ranging study, Neil Lazarus explores the subject of cultural practice in the modern world system. The book contains individual chapters on a range of topics from modernity, globalization and the 'West', and nationalism and decolonization, to cricket and popular consciousness in the English-speaking Caribbean. Lazarus analyses social movements, ideas and cultural practices that have migrated from the 'First world' to the 'Third world' over the course of the twentieth century. Nationalism and Cultural Practice in the Postcolonial World offers an enormously erudite reading of culture and society in today's world and includes extended discussion of the work of such influential writers, critics and activists as Frantz Fanon, C. L. R. James, Edward Said, Gayatri Spivak, Samir Amin, Raymond Williams, Paul Gilroy and Partha Chatterjee. This book is a politically focused, materialist intervention into postcolonial and cultural studies, and constitutes a major reappraisal of the debates on politics and culture in these fields.
A primary problem in the area of natural language processing has been that of semantic analysis. This book aims to look at the semantics of natural languages in context. It presents an approach to the computational processing of English text that combines current theories of knowledge representation and reasoning in Artificial Intelligence with the latest linguistic views of lexical semantics. This results in distinct advantages for relating the semantic analysis of a sentence to its context. A key feature is the clear separation of the lexical entries that represent the domain-specific linguistic information from the semantic interpreter that performs the analysis. The criteria for defining the lexical entries are firmly grounded in current linguistic theories, facilitating integration with existing parsers. This approach has been implemented and tested in Prolog on a domain for physics word problems and full details of the algorithms and code are presented. Semantic Processing for Finite Domains will appeal to postgraduates and researchers in computational linguistics, and to industrial groups specializing in natural language processing.
The Old English dialogues of Solomon and Saturn edited here survive in two manuscripts. Cambridge, Corpus Christi College 422, Part A (A) contains a dialogue between Solomon and Saturn on the Pater Noster in verse (SolSatI) and prose (SolSatPNPr), followed by a poetical dialogue on a range of subjects (SolSatII); at the head of p. 13 is a fragment of verse (SolSatFrag). Cambridge, Corpus Christi College 41 (B) contains the first 95 lines of SolSatI. In A SolSatI begins on the badly damaged and mostly unreadable p. 1, continuing to p. 6, line 12, where without change of manuscript format, the dialogue continues in prose; this proceeds as far as the bottom of p. 12, where it terminates abruptly owing to the loss of a leaf. On the top of p. 13 (recto of the last leaf of the first quire) are seven lines of text in verse (nine edited lines), which are clearly the conclusion of a dialogue, though whether these were originally designed to conclude SolSatI or SolSatII has proved a subject of debate. Following these lines on p. 13 a second verse dialogue begins, continuing to p. 26, where it ends incomplete. In B the first part of SolSatI (as far as line 94a) is found written in the margins of pp. 196–8 of the Old English translation of Bede’s Historia Ecclesiastica.
A (CCCC 422; Part A)
CCCC 422 (Ker, no. 70; Gneuss nos. 110, 111) is comprised of two distinct parts. The larger of these is Part B, a missal; its calendar dates this part to c.1060. The two quires containing the dialogues making up Part A are in a unique hand, and apparently were used by a medieval binder as flyleaves for the missal. These leaves are now bound together at the front of CCCC 422, and their Parkerian pagination, pp. 1–26, points to rearrangement in the sixteenth century. The leaves of Part A are difficult to read in places, owing to rubbing and the application of reagents. Page 1, which was long an outside page, is barely legible, though Page’s examination of the leaf under ultraviolet light has confirmed that those parts of the text which can be seen substantially agree with B.