To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Advances in language engineering may be dependent on theoretical principles originating from linguistics, since both share a common object of enquiry, natural language structures. We outline an approach to term extraction that rests on theoretical claims about the structure of words. We use the structural properties of compound words to specifically elicit the sets of terms defined by type hierarchies such as hyponymy and meronymy. The theoretical claims revolve around the head-modifier principle, which determines the formation of a major class of compounds. Significantly it has been suggested that the principle operates in languages other than English. To demonstrate the extendibility of our approach beyond English, we present a case study of term extraction in Chinese, a language whose written form is the vehicle of communication for over 1.3 billion language users, and therefore has great significance for the development of language engineering technologies.
Named entity recognition identifies and classifies entity names in a text document into some predefined categories. It resolves the “who”, “where” and “how much” problems in information extraction and leads to the resolution of the “what” and “how” problems in further processing. This paper presents a Hidden Markov Model (HMM) and proposes a HMM-based named entity recognizer implemented as the system PowerNE. Through the HMM and an effective constraint relaxation algorithm to deal with the data sparseness problem, PowerNE is able to effectively apply and integrate various internal and external evidences of entity names. Currently, four evidences are included: (1) a simple deterministic internal feature of the words, such as capitalization and digitalization; (2) an internal semantic feature of the important triggers; (3) an internal gazetteer feature, which determines the appearance of the current word string in the provided gazetteer list; and (4) an external macro context feature, which deals with the name alias phenomena. In this way, the named entity recognition problem is resolved effectively. PowerNE has been benchmarked with the Message Understanding Conferences (MUC) data. The evaluation shows that, using the formal training and test data of the MUC-6 and MUC-7 English named entity tasks, and it achieves the F-measures of 96.6 and 94.1, respectively. Compared with the best reported machine learning system, it achieves a 1.7 higher F-measure with one quarter of the training data on MUC-6, and a 3.6 higher F-measure with one ninth of the training data on MUC-7. In addition, it performs slightly better than the best reported handcrafted rule-based systems on MUC-6 and MUC-7.
With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a large bracketed corpus since late 1998. The first two installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged and syntactically bracketed, have been released to the public via LDC (www.ldc.upenn.edu). In this paper, we discuss several Chinese linguistic issues and their implications for our treebanking efforts and how we address these issues when developing our annotation guidelines. We also describe our engineering strategies to improve speed while ensuring annotation quality.
Multimodal interfaces are systems that allow input and/or output to be conveyed over multiple channels such as speech, graphics, and gesture. In addition to parsing and understanding separate utterances from different modes such as speech or gesture, multimodal interfaces also need to parse and understand composite multimodal utterances that are distributed over multiple input modes. We present an approach in which multimodal parsing and understanding are achieved using a weighted finite-state device which takes speech and gesture streams as inputs and outputs their joint interpretation. In comparison to previous approaches, this approach is significantly more efficient and provides a more general probabilistic framework for multimodal ambiguity resolution. The approach also enables tight-coupling of multimodal understanding with speech recognition. Since the finite-state approach is more lightweight in computational needs, it can be more readily deployed on a broader range of mobile platforms. We provide speech recognition results that demonstrate compensation effects of exploiting gesture information in a directory assistance and messaging task using a multimodal interface.
Based on constraint optimization techniques, an architecture for robust parsing of natural language utterances has been developed. The resulting system is able to combine possibly contradicting evidence from a variety of information sources, using a plausibility-based arbitration procedure to derive fairly rich structural representations, comprising aspects of syntax, semantics and other description levels of language. The results of a series of experiments are reported which demonstrate the high potential for robust behaviour with respect to ungrammaticality, incomplete utterances, and temporal pressure.
This paper describes a novel approach to multi-document summarization, which explicitly addresses the problem of detecting, and retaining for the summary, multiple themes in document collections. We place equal emphasis on the processes of theme identification and theme presentation. For the former, we apply Iterative Residual Rescaling (IRR); for the latter, we argue for graphical display elements. IRR is an algorithm designed to account for correlations between words and to construct multi-dimensional topical space indicative of relationships among linguistic objects (documents, phrases, and sentences). Summaries are composed of objects with certain properties, derived by exploiting the many-to-many relationships in such a space. Given their inherent complexity, our multi-faceted summaries benefit from a visualization environment. We discuss some essential features of such an environment.
This paper reports on a number of experiments which are designed to investigate the extent to which current NLP resources are able to syntactically and semantically analyse biomedical text. We address two tasks: (a) parsing a real corpus with a hand-built wide-coverage grammar, producing both syntactic analyses and logical forms and (b) automatically computing the interpretation of compound nouns where the head is a nominalisation (e.g. hospital arrival means an arrival at hospital, while patient arrival means an arrival of a patient). For the former task we demonstrate that flexible and yet constrained pre-processing techniques are crucial to success: these enable us to use part-of-speech tags to overcome inadequate lexical coverage, and to package up complex technical expressions prior to parsing so that they are blocked from creating misleading amounts of syntactic complexity. We argue that the XML-processing paradigm is ideally suited for automatically preparing the corpus for parsing. For the latter task, we compute interpretations of the compounds by exploiting surface cues and meaning paraphrases, which in turn are extracted from the parsed corpus. This provides an empirical setting in which we can compare the utility of a comparatively deep parser vs. a shallow one, exploring the trade-off between resolving attachment ambiguities on the one hand and generating errors in the parses on the other. We demonstrate that a model of the meaning of compound nominalisations is achievable with the aid of current broad-coverage parsers.
One way to keep in touch with what is happening in the commercial speech and language technology world is to pay occasional visits to the websites of HLT Central (at www.hltcentral.org) and LT World (at www.lt-world.org). Both sites provide links to news stories and press releases from companies and other organizations active in the area. The people who run these sites trawl the web for news stories of relevance, saving you the trouble of doing that yourself.
Spelling errors that happen to result in a real word in the lexicon cannot be detected by a conventional spelling checker. We present a method for detecting and correcting many such errors by identifying tokens that are semantically unrelated to their context and are spelling variations of words that would be related to the context. Relatedness to context is determined by a measure of semantic distance initially proposed by Jiang and Conrath (1997). We tested the method on an artificial corpus of errors; it achieved recall of 23–50% and precision of 18–25%.
This chapter gives an estimate of the research value of word-for-word translation into a pidgin language, rather than into the full normal form of an output language.
Introduction
The basic problem in machine translation is that of multiple meaning, or polysemy. There are two lines of research that highlight this problem in that both set a low value on the information-carrying value of grammar and syntax, and a high one on the resolution of semantic ambiguity. These are:
matching the main content-bearing words and phrases with a semantic thesaurus that determines their meanings in context;
word-for-word matching translation into a pidgin language using a very large bilingual word-and-phrase dictionary.
This chapter examines the second.
The phrase ‘Mechanical Pidgin’ was first used by R. H. Richens to describe the output given at the beginning of Section 2 of this chapter (below), which, he said, was not English at all but a special language, with the vocabulary of English and a structure reminiscent of Chinese. Machine translation output always is a pidgin, whose characteristics per se are never investigated. Either the samples of this pidgin are post-edited into fuller English, or the nature of the output is explained away as low-level machine translation, or rough machine translation, or some vague remark is made to the effect that pidgin machine translation is all right for most purposes.
To the question ‘What is a word?’ philosophers usually give, in succession (as the discussion proceeds), three replies:
‘Everybody knows what a word is.’
‘Nobody knows what a word is.’
‘From the point of view of logic and philosophy, it doesn't matter anyway what a word is, since the statement is what matters, not the word.’
In this paper I shall discuss these three reactions in turn, and dispute the last. Since it is part of my argument that the ways of thinking of several different disciplines must be correlated if we are to progress in our thinking as to what a word is, I shall try to exemplify as many differing contentions as possible by the use of the word ward, since this word is a word which can be used in all senses of ‘word’, which many words cannot.
Two preliminary points about terminology need to be made clear. I am using the word ‘word’ here in the type sense as used by logicians, rather than in the token sense, as synonymous with ‘record of single occurrence of pattern of sound-waves issuing from the mouth’. Thus, when I write here ‘mouth’, ‘mouth’, ‘mouth’, I write only one word.
The second point is that I use in this paper, in different senses, the terms ‘Use’, ‘usage’ and ‘use’. The question as to how the words ‘usage’ and ‘use’ should be used is, as philosophers know, a very thorny one.
1. Current relativist conceptions of science depend widely, though vaguely, upon the insights of T. S. Kuhn (1962), and, in particular, upon his notion of a paradigm. This notion is being used by relativists to support the contention that, since scientific theory is paradigm-founded, and therefore context-based, there can be no one discernible process of scientific verification. However, as I have shown in an earlier paper (1970a), there is another, more exact conception of a Kuhnian paradigm to be considered: namely, that conception of it which says that it is either an analogically used artefact, or even sometimes an actual ‘crude analogy’, that is, an analogical figure of speech expressed in a string of words.
This alternative conception of paradigm, far from supporting a verification-deprived conception of science (which, for those of us philosophers who are also trying to do technological science, just seems a conception of science totally divorced from scientific reality) can, on the contrary, be used to enrich and amplify the most strictly verification-based philosophy of science that is known, namely the Braithwaitean conception of it as a verifiable hypothetico-deductive (H-D) system. For such a paradigm, even though, in unselfconscious scientific thinking, it is usually a crude and concrete conceptual structure, can yet be shown to yield a set of abstract attributes.
The purpose of this chapter is to present a philosophical model of real translation. ‘Translation’ is here used in its ordinary sense: in the sense, that is, in which we say that passages of Burke can be translated into Ciceronian Latin prose, or that the sentence ‘He shot the wrong woman’ is untranslatable into good French. The term ‘philosophical’, however, needs some explaining, since, so far as I know, no one has made a philosophical model of translation as yet. I shall call a model of translation ‘philosophical’ if it has the following characteristics:
It must not only throw some light on the problem of transformation within a language, but must deal also with the problem of reference to something. That is to say, it must relate the strings of language units in the various languages with which it deals to public and recognisable situations in everyday life. It is characteristic of philosophers that, unlike most linguists, they do not regard a text in language as self-contained.
It must deal in concepts, not only in words or terms. All philosophers believe in concepts, though they sometimes pretend not to.
It must face, and not evade, the problem of constructing a universal grammar, while yet recognising fully how greatly languages differ, and howperipheral is the whole problemof determining the nature of language.
The study of language, like the study of mathematical systems, has always been thought to be relevant to the study of forms of argument in science. Language as the scientist uses it, however, is assumed to be potentially interlingual, conceptual and classificatory. This fact makes current philosophical methods of studying language irrelevant to the philosophy of science.
An alternative method of analysing language is proposed. This is that we should take as a model for language the classification system of a great library. Such a classification system is described.
Classification systems of this kind, however, tend to break down because of the phenomena of profusion of meaning, extension of meaning and overlap of meaning in actual languages. The librarian finds that empirically based semantic aggregates (overlapping clusters of meanings) are forming within the system. These are defined as concepts. By taking these aggregates as units, the system can still be used to classify.
An outline sketch is given of a mathematical model of language, language being taken as a totality of semantic aggregates. Language, thus considered, forms a finite lattice. A procedure for retrieving information within the system is described.
The scientific procedures of phrase-coining, classifying and analogy-finding are described in terms of the model.
The point of relevance of the study of language to the philosophy of science
Two very general disciplines have always been thought especially relevant to our understanding of the nature of science.