To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter we look at the process of mapping an abstract text specification and its constituent phrase specifications into a surface text, made up of words, punctuation symbols, and mark-up annotations. This is known as surface realisation. Much of this discussion is centred around three software packages (kpml, surge, and realpro) which can often be used to carry out part of the mapping process.
Introduction
In the previous chapter, we saw how a text specification can be constructed from a document plan via the process of microplanning. The text specification provides a complete specification of the document to be generated but does not in itself constitute that document. It is a more abstract representation whose nature is suited to the kinds of manipulation required at earlier stages of the generation process. This representation needs to be mapped into a surface form, this being a sequence of words, punctuation symbols, and mark-up annotations which can be delivered to the document presentation system being used.
As described in Section 3.4, a text specification is a tree whose leaf nodes are phrase specifications and whose internal nodes show how phrases are grouped into document structures such as paragraphs and sections; internal nodes may also specify additional information about structures, such as section titles. Phrase specifications describe individual sentences; they may also describe sentence fragments in cases where these fragments are realised as orthographically separate elements (such as section titles or entries in an itemised list).
In this chapter and the two following, we turn to details of the components that make up the nlg system architecture we introduced in Chapter 3. Our concern in this chapter is with the component we have called the document planner.
The chapter is organised as follows. In Section 4.1, we give an overview of the task of document planning, including the inputs and outputs of this process; this is largely a recap of material introduced in Chapter 3. In Section 4.2, we look at domain modelling and the related task of message definition: Here we are concerned with the process of deciding how domain information should be represented for the purposes of natural language generation. In Sections 4.3 and 4.4, we turn to the two component tasks implicated in our view of document planning, these being content determination and document structuring. In each case we describe a number of different approaches that can be taken to the tasks. In Section 4.5, we look at how these tasks can be combined architecturally within a document planning module. The chapter ends with some pointers to further reading in Section 4.6.
Introduction
What Document Planning Is About
In the nlg system architecture we presented in Chapter 3, the document planner is responsible for deciding what information to communicate (this being the task of content determination) and determining how this information should be structured for presentation (this being the task of document structuring).
So far in this book we have considered the generation of natural language as if it were concerned with the production of text abstracted away from embodiment in any particular medium. This does not reflect reality, of course: When we are confronted with language, it is always embodied, whether in a speech stream, on a computer screen, on the page of a book, or on the back of a breakfast cereal packet. In this final chapter, we look beyond text generation and examine some of the issues that arise when we consider the generation of text contained within some medium.
Introduction
Linguistic content can be delivered to a reader or hearer in many ways. Consider, for example, a few of the ways in which a weather report might be presented to an audience:
Email. When delivered as the body of a simple text-only email message, it might consist of nothing more than a sequence of words and punctuation symbols, with blank lines or indentations to indicate some of the structure in the text.
Newspaper article. In this case it could include typographic elements, such as the use of bold and italic typefaces, and accompanying graphics, such as a weather map.
Web page. As well as making use of typographic devices and graphical elements, in this case it could also include hypertext links to related information, such as the weather in neighbouring cities.
Radio broadcast. Prosodic elements such as pauses and pitch changes might be used to communicate emphasis and structure.
This book describes natural language generation (nlg), which is a subfield of artificial intelligence and computational linguistics that is concerned with building computer software systems that can produce meaningful texts in English or other human languages from some underlying nonlinguistic representation of information. nlg systems use knowledge about language and the application domain to automatically produce documents, reports, help messages, and other kinds of texts.
As we enter the new millennium, work in natural language processing, and in particular natural language generation, is at an exciting stage in its development. The mid- to late 1990s have seen the emergence of the first fielded nlg applications, and the first software houses specialising in the development of nlg technology. At the time of writing, only a handful of systems are in everyday use, but many more are under development and should be fielded within the next few years. The growing interest in applications of the technology has also changed the nature of academic research in the field. More attention is now being paid to software engineering issues, and to using nlg within a wider document generation process that incorporates graphical elements and other realities of the Web-based information age such as hypertext links.
However, despite the growing interest in nlg in general and applied nlg in particular, it is often difficult for people who are not already knowledgeable in the field to obtain a comprehensive overview of what is involved in building a natural language generation system.
Natural language generation (nlg) is the subfield of artificial intelligence and computational linguistics that focuses on computer systems that can produce understandable texts in English or other human languages. Typically starting from some nonlinguistic representation of information as input, nlg systems use knowledge about language and the application domain to automatically produce documents, reports, explanations, help messages, and other kinds of texts.
nlg is both a fascinating area of research and an emerging technology with many real-world applications. As a research area, nlg brings a unique perspective on fundamental issues in artificial intelligence, cognitive science, and human–computer interaction. These include questions such as how linguistic and domain knowledge should be represented and reasoned with, what it means for a text to be well written, and how information is best communicated between machine and human. From a practical perspective, nlg technology is capable of partially automating routine document creation, removing much of the drudgery associated with such tasks. It is also being used in the research laboratory, and we expect soon in real applications, to present and explain complex information to people who do not have the background or time required to understand the raw data. In the longer term, nlg is also likely to play an important role in human–computer interfaces and will allow much richer interaction with machines than is possible today.
A piece of software as complex as a complete natural language generation system is unlikely to be constructed as a monolithic program. In this chapter, we introduce a particular architecture for nlg systems, by which we mean a specification of how the different types of processing are distributed across a number of component modules. As part of this architectural specification, we discuss how these modules interact with each other and we describe the data structures that are passed between the modules.
Introduction
Like other complex software systems, nlg systems are generally easier to build and debug if they are decomposed into distinct, well-defined, and easily-integrated modules. This is especially true if the software is being developed by a team rather than by one individual. Modularisation can also make it easier to reuse components amongst different applications and can make it easier to modify an application. Suppose, for example, we adopt a modularisation where one component is responsible for selecting the information content of a text and another is responsible for expressing this content in some natural language. Provided a well-defined interface between these components is specified, different teams or individuals can work on the two components independently. It may be possible to reuse the components (and in particular the second, less application-dependent component) independently of one another.
This paper focuses on the design methodology of the MultiLingual Dictionary-System (MLDS), which is a human-oriented tool for assisting in the task of translating lexical units, oriented to translators and conceived from studies carried out with translators. We describe the model adopted for the representation of multilingual dictionary-knowledge. Such a model allows an enriched exploitation of the lexical-semantic relations extracted from dictionaries. In addition, MLDS is supplied with knowledge about the use of the dictionaries in the process of lexical translation, which was elicitated by means of empirical methods and specified in a formal language. The dictionary-knowledge along with the task-oriented knowledge are used to offer the translator active, anticipative and intelligent assistance.
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling. A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost.
Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar. In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision.
We show how the DOP model can be used for fast and robust context-sensitive processing of spoken input in a practical spoken dialogue system called OVIS. OVIS (Openbaar Vervoer Informatie Systeem) – ‘Public Transport Information System’, is a Dutch spoken language information system which operates over ordinary telephone lines. The prototype system is the immediate goal of the NWO Priority Programme ‘Language and Speech Technology’. In this paper, we extend the original Data-Oriented Parsing (DOP) model to context-sensitive interpretation of spoken input. The system we describe uses the OVIS corpus (which consists of 10,000 trees enriched with compositional semantics) to compute from an input word-graph the best utterance together with its meaning. Dialogue context is taken into account by dividing up the OVIS corpus into context-dependent subcorpora. Each system question triggers a subcorpus by which the user answer is analysed and interpreted. Our experiments indicate that the context-sensitive DOP model obtains better accuracy than the original model, allowing for fast and robust processing of spoken input.
Most parsing algorithms require phrases that are to be combined to be either contiguous or marked as being ‘extraposed’. The assumption that phrases which are to be combined will be adjacent to one another supports rapid indexing mechanisms: the fact that in most languages items can turn up in unexpected locations cancels out much of the ensuing efficiency. The current paper shows how ‘out of position’ items can be incorporated directly. This leads to efficient parsing even when items turn up having been right-shifted, a state of affairs which makes Johnson and Kay's (1994) notion of ‘sponsorship’ of empty nodes inapplicable.
In this paper we present results concerning the large scale automatic extraction of pragmatic content from email by a system based on a phrase matching approach to speech act detection combined with empirical detection of speech act patterns in corpora. The results show that most speech acts that occur in this corpus can be recognized by the approach. This investigation is supported by analysis of a corpus consisting of 1000 emails.