To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In the previous chapter, we made some suggestions about a schema theory of language acquisition by the child, for whom language is initially an external reality to be mastered. We also noted that the external language is itself a social construction – a “collective representation,” as Durkheim called it. In this section, and more particularly in Chapter 10, we start to build some bridges between the essentially individualist approach and a more comprehensive picture of language as also embodying the constructions and classifications of a culture. In this area there are plenty of empirical studies on which to draw, deriving from literary criticism, social anthropology, and the history of ideas and of science, as well as Wittgensteinian philosophy. However, cognitive science is as yet an infant in these worlds, and we cannot claim to have more than hints for an adequate theory of such social schemas. In this section, we concentrate not on the empirical aspects of sociolinguistics but on the philosophical implications of our theory of language so far.
Our emphasis on the dynamics of meaning change and its holistic character already brings us into conflict with a long-entrenched philosophical tradition. Speaking of the suggestion that the meaning of words changes whenever our mental state changes, when, for example, we acquire more knowledge about the subject matter, Putnam (1981, p. 22n) says this “would not allow any words to ever have the same meaning, and would thus amount to an abandonment of the very notion of the word ‘meaning’.”
Natural language is an integral part of our lives. Language serves as the primary vehicle by which people communicate and record information. It has the potential for expressing an enormous range of ideas, and for conveying complex thoughts succinctly. Because it is so integral to our lives, however, we usually take its powers and influence for granted.
The aim of computational linguistics is, in a sense, to capture this power. By understanding language processes in procedural terms, we can give computer systems the ability to generate and interpret natural language. This would make it possible for computers to perform linguistic tasks (such as translation), process textual data (books, journals, newspapers), and make it much easier for people to access computer-stored data. A well-developed ability to handle language would have a profound impact on how computers are used.
The potential for natural language processing was recognized quite early in the development of computers, and work in computational linguistics – primarily for machine translation – began in the 1950s at a number of research centers. The rapid growth in the field, however, has taken place mostly since the late 1970s. A 1983 survey by the Association for Computational Linguistics (Evens and Karttunen 1983) listed 85 universities granting degrees in computational linguistics. A 1982 survey by the ACM (Association for Computing Machinery) Special Interest Group on Artificial Intelligence (Kaplan 1982) listed 59 university and research centers with projects in computational linguistics, and the number continues to grow.
Computational linguistics is the study of computer systems for understanding and generating natural language. In this volume we shall be particularly interested in the structure of such systems, and the design of algorithms for the various components of such systems.
Why should we be interested in such systems? Although the objectives of research in computational linguistics are widely varied, a primary motivation has always been the development of specific practical systems which involve natural language. Three classes of applications which have been central in the development of computational linguistics are
Machine translation. Work on machine translation began in the late 1950s with high hopes and little realization of the difficulties involved. Problems in machine translation stimulated work in both linguistics and computational linguistics, including some of the earliest parsers. Extensive work was done in the early 1960s, but a lack of success, and in particular a realization that fully-automatic high-quality translation would not be possible without fundamental work on text ‘understanding’, led to a cutback in funding. Only a few of the current projects in computational linguistics in the United States are addressed toward machine translation, although there are substantial projects in Europe and Japan (Slocum 1984, 1985; Tucker 1984).
Information retrieval. Because so much of the information we use appears in natural language form – books, journals, reports another application in which interest developed was automatic information retrieval from natural language texts. […]
Up to now, we have restricted ourselves to determining the structure and meaning of individual sentences. Although we have used limited extrasentential information (for anaphora resolution), we have not examined the structure of entire texts. Yet the information conveyed by a text is clearly more than the sum of its parts – more than the meanings of its individual sentences. If a text tells a story, describes a procedure, or offers an argument, we must understand the connections between the component sentences in order to have fully understood the story. These connections are needed both per se (to answer questions about why an event occurred, for example) and to resolve ambiguities in the meanings of individual sentences. Discourse analysis is the study of these connections. Because these connections are usually implicit in the text, identifying them may be a difficult task.
As a simple example of the problems we face, consider the following brief description of a naval encounter:
Just before dawn, the Valiant sighted the Zwiebel and fired two torpedoes. It sank swiftly, leaving few survivors.
The most evident linguistic problem we face is finding an antecedent for ‘it’. There are four candidates in the first sentence: ‘dawn’, ‘Valiant’, ‘Zwiebel’, and ‘torpedoes’. Semantic classification should enable us to exclude ‘dawn’ (*‘dawn sinks’), and number agreement will exclude ‘torpedoes’, but that still leaves us with two candidates: ‘the Valiant’ and ‘the Zwiebel’ (which are presumably both ships of some sort).
As we noted in the first chapter, language generation has generally taken second place to language analysis in computational linguistics research. This imbalance reflects a basic property of language, namely, that there are many ways of saying the same thing. In order for a natural language interface to be fluent, it should be able to accept most possible paraphrases of the information or commands the user wishes to transmit. On the other hand, it will suffice to generate one form of each message the system wishes to convey to the user.
As a result, many systems have combined sophisticated language analysis procedures with rudimentary generation components. Often generation involves nothing more than ‘filling in the blanks’ in a set of predefined message formats. This has been adequate for the simple messages many systems need to express: values retrieved from a data base, error messages, instructions to the user.
More sophisticated systems, however, have more complex messages to convey. People querying a data base in natural language often begin by asking about the structure or general content of the data base rather than asking for specific data values (Malhotra 1975); we would like to extend natural language data base interfaces so that they can answer such questions. For systems employing lengthy sequences of inferences, such as those for medical diagnosis (e.g., Shortliffe 1976), user acceptance and system improvement depend critically on the ability of the system to explain its reasoning.
Syntax analysis performs two main functions in analyzing natural language input:
Determining the structure of the input. In particular, syntax analysis should identify the subject and objects of each verb and determine what each modifying word or phrase modifies. This is most often done by assigning a tree structure to the input, in a process referred to as parsing.
Regularizing the syntactic structure. Subsequent processing (i.e., semantic analysis) can be simplified if we map the large number of possible input structures into a smaller number of structures. For example, some material in sentences (enclosed in brackets in the examples below) can be omitted or ‘zeroed’:
John ate cake and Mary [ate] cookies.
… five or more [than five] radishes …
He talks faster than John [talks].
Sentence structure can be regularized by restoring such zeroed information. Other transformations can relate sentences with normal word order (‘I crushed those grapes. That I like wine is evident.’) to passive (‘Those grapes were crushed by me.’) and cleft (‘It is evident that I like wine.’) constructions, and can relate nominal (‘the barbarians' destruction of Rome’) and verbal (‘the barbarians destroyed Rome’) constructions. Such transformations will permit subsequent processing to concern itself with a much smaller number of structures. […]
What is the objective of semantic analysis? We could say that it is to determine what a sentence means, but by itself this is not a very helpful answer. It may be more enlightening to say that, for declarative sentences, semantics seeks to determine the conditions under which a sentence is true or, almost equivalently, what the inference rules are among sentences of the language. Characterizing the semantics of questions and imperatives is a bit more problematic, but we can see the connection with declaratives by noting that, roughly speaking, questions are requests to be told whether a sentence is true (or to be told the values for which a certain sentence is true) and imperatives are requests to make a sentence true.
People who study natural language semantics find it desirable (or even necessary) to define a formal language with a simple semantics, thus changing the problem to one of determining the mapping from natural language into this formal language. What properties should this formal language have (which natural language does not)? It should
*be unambiguous
*have simple rules of interpretation and inference, and in particular
*have a logical structure determined by the form of the sentence
We shall examine some such languages, the languages of the various logics, shortly.
Of course, when we build a practical natural language system our interest is generally not just finding out if sentences are true or false.
This paper starts by tracing the architecture of document preparation systems. Two basic types of document representations appear : at the page level or at logical level. The paper then focuses on logical level representations and tries to survey three existing formalisms : SGML, Interscript and ODA.
Introduction
Document preparation systems might be now the most commonly used computer systems, ranging from stand-alone text processing individual machines to highly sophisticated systems running on mainframe computers. All of those systems internally use a more or less formal system for representing documents. Document representation formalisms are very different according to their goals. Some of them define the interface with the printing device, they are oriented towards a precise geometric description of the contents of each page in a document. Others are used internally in systems as a memory representation. Yet others have to be learned by users ; they are symbolic languages used to control document processing.
The trouble is that there are today nearly as many representation formalisms as document preparation systems. This makes it nearly impossible, first to interchange documents among heterogeneous systems, second to have standard programming interfaces for developping systems. Standardization organizations and large companies are now trying to establish standards in the field in order to stop proliferation of formalisms and facilitate document interchange.
This paper focuses in the last sections on three document representation formalisms often called ‘revisable formats’, namely SGML [SGML], ODA [ODA], and Interscript [Ayers & al.], [Joloboff & al.]. In order to better understand what is a revisable format, the paper starts with a look at the evolution of the architecture of document preparation systems.
The paper presents the design of a document preparation system that allows users to make use of existing batch formatters and yet provides an interactive user interface with what-you-see-is-almost-what-you-get feedback.
Introduction
Increasing numbers of people are using computers for the preparation of documents. Many of these new computer users are not “computer types”; they have a problem (to produce a neatly formatted document), they know the computer can help them, and they want the result with a minimum of (perceived) fuss and bother. The terms in which they present the problem to the computer should be “theirs” – easy for them to use and understand and based on previous document experience.
Many powerful document preparation tools exist that are capable of producing high quality output. However, they are often awkward (some would say difficult) to use, especially for the novice or casual user, and a substantial amount of training is usually necessary before they can be used intelligently.
This paper presents the design of a document preparation system that allows users to make use of existing formatters and yet makes document entry relatively easy. The following topics are discussed:
the requirements and overall design for such a system, and
some of the issues to be resolved in constructing the system.
First, some terminology is clarified.
Terms and Concepts
We use Shaw's model for documents [Shaw80, Puruta82, Kimura84]. A document is viewed as a hierarchy of objects, where each object is an instance of a class that defines the possible components and other attributes of its instances. Typical (low level) classes are document components such as sections, paragraphs, headings, footnotes, figures, and tables.
For many years text preparation and document manipulation have been poor relations in the computing world, and it is only recently that they have taken their rightful place in the mainstream of computer research and development. Everyone has their own favourite reason for this change: word processors, workstations with graphics screens, nonimpact printers, or authors preparing their own manuscripts.
Whatever the reason, people in computing have suddenly found themselves using the same equipment and fighting the same problems as those in printing and publishing. It would be nice to say that we are all working happily together, but there are still plenty of disputes (which is healthy) and plenty of indifference (which is not). There is no doubt, however, that this coming together of different disciplines has brought new life and enthusiasm with it.
The international conference on Text Processing and Document Manipulation at Nottingham, is not the first conference to focus on this field of computing. It follows in the footsteps of Research and Trends in Document Preparation Systems at Lausanne in 1981, the Symposium on Text Manipulation at Portland in 1981, La Manipulation de Documents at Rennes in 1983, and the recent PROTEXT conferences in Dublin. We hope, however, that it marks the beginning of a regular series of international conferences that will bring top researchers and practitioners together to exchange ideas and share their enthusiasm with a wide audience.
As the papers for this conference started to come in, a number of themes began to emerge. The dominant theme (in number of papers) was document structures for interactive editing.
Computer text processing is still in the assembly-language era, to use an analogy to program development. The low-level tools available have sufficient power, but control is lacking. The result is that documents produced with computer assistance are often of lower quality than those produced by hand: they look beautiful, but the content and organization suffer. Two promising ideas for correcting this situation are explored: (1) adapting methods of modern, high-level program development (stepwise refinement and iterative enhancement) to document preparation; (2) using a writing environment controlled by a rule-based editor, in which structure is enforced and mistakes more difficult to make.
Wonderful Appearance–Wretched Content
With the advent of relatively inexpensive laser printers, computer output is being routinely typeset. It can be expected that there will be a revolution in the way business and technical documents are created, based on the use of low-cost typesetters. Easy typesetting and graphics is an extension of word-processing capability, which is already widespread. The essential feature of word processing is its ability to quickly reproduce a changed document with mechanical perfection. However, as the appearance improves, the quality of writing seems to fall in proportion. These forces are probably at work: (1) More people can (attempt to) write using better technology, and because writing is hard, novices often produce poor work. (2) With improved technology, projects are attempted that were previously avoided; now they are done, badly. These factors are familiar from programming, and suggest an analogy between creating a document and developing a program. The current word-processing situation corresponds to the undisciplined use of programming languages that preceded so-called “modern programming practices.”
This paper describes both the use and the implementation of W, an interactive text formatter. In W, a document is interactively defined as a hierarchy of nested components. Such a hierarchy may be system- or user-defined. The hierarchy is used both by the W full-screen editor, and by the W formatting process, absolving the user from providing any layout commands as such. W manipulates text, such non-text items as mathematical formulae, and has provision for the inclusion of general graphical items.
Introduction
W is an interactive text-editor and document preparation facility being developed within the department of Computer Science at Manitoba. A working prototype of W, known as W-p, has been described elsewhere [King84]. W is a considerable development of that earlier system, but retains the same basic philosophy:
W is an interactive, extensible, integrated editor and formatter;
W adheres as closely as possible to the “what you see is what you get” (wysiwyg) philosophy;
W encompasses a wide range of document items, incuding text, tables, mathematical formulae, and provision for general graphical items;
W is portable and adaptable; that is, several versions of W are being produced to run on different architectures; although the user interface will differ in its detail, the underlying system will be common;
W is user extensible in a variety of ways.
The remainder of this paper is organised as follows. Section 2 describes W from the user's viewpoint and gives some details of its implementation. For the most part, it is a review of material which is covered in greater depth in [King84].
The paper presents a new system for automatic matching of bibliographic data corresponding to items of full textual electronic documents. The problem can otherwise be expressed as the identification of similar or duplicate items existing in different bibliographic databases. A primary objective is the design of an interactive system where matches and near misses are displayed on the user's terminal so that he can confirm or reject the identification before the associated full electronic versions are located and processed further.
Introduction
There is no doubt that ‘electronic publishing’ and other computer based tools for the production and dissemination of printed material open up new horizons for efficient communication. The problems currently faced by the designers of such systems are enormous. One problem area is the identification of duplicate material especially when there is more than one source generating similar documents. Abstracting is a good example here. Another problem area is the linkage between full text and bibliographic databases.
As part of its attempt to establish collaboration between different countries, the European Economic Community initiated the DOCDEL programme of research and a series of studies such as DOCOLSYS which investigate the present situation in Europe regarding document identification, location and ordering with particular reference to electronic ordering and delivery of documents.
The majority of DOCDEL systems likely to be developed fall under one of the following areas: