To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
We describe SkillSum, a Natural Language Generation (NLG) system that generates a personalised feedback report for someone who has just completed a screening assessment of their basic literacy and numeracy skills. Because many SkillSum users have limited literacy, the generated reports must be easily comprehended by people with limited reading skills; this is the most novel aspect of SkillSum, and the focus of this paper. We used two approaches to maximise readability. First, for determining content and structure (document planning), we did not explicitly model readability, but rather followed a pragmatic approach of repeatedly revising content and structure following pilot experiments and interviews with domain experts. Second, for choosing linguistic expressions (microplanning), we attempted to formulate explicitly the choices that enhanced readability, using a constraints approach and preference rules; our constraints were based on corpus analysis and our preference rules were based on psycholinguistic findings. Evaluation of the SkillSum system was twofold: it compared the usefulness of NLG technology to that of canned text output, and it assessed the effectiveness of the readability model. Results showed that NLG was more effective than canned text at enhancing users' knowledge of their skills, and also suggested that the empirical ‘revise based on experiments and interviews’ approach made a substantial contribution to readability as well as our explicit psycholinguistically inspired models of readability choices.
The limits on predictability and refinement of English structural annotation are examined by comparing independent annotations, by experienced analysts using the same detailed published guidelines, of a common sample of written texts. Three conclusions emerge. First, while it is not easy to define watertight boundaries between the categories of a comprehensive structural annotation scheme, limits on inter-annotator agreement are in practice set more by the difficulty of conforming to a well-defined scheme than by the difficulty of making a scheme well defined. Secondly, although usage is often structurally ambiguous, commonly the alternative analyses are logical distinctions without a practical difference – which raises questions about the role of grammar in human linguistic behaviour. Finally, one specific area of annotation is strikingly more problematic than any other area examined, though this area (classifying the functions of clause-constituents) seems a particularly significant one for human language use. These findings should be of interest both to computational linguists and to students of language as an aspect of human cognition.
Finite-state technology is considered the preferred model for representing the phonology and morphology of natural languages. The attractiveness of this technology for natural language processing stems from four sources: modularity of the design, due to the closure properties of regular languages and relations; the compact representation that is achieved through minimization; efficiency, which is a result of linear recognition time with finite-state devices; and reversibility, resulting from the declarative nature of such devices. However, when wide-coverage morphological grammars are considered, finite-state technology does not scale up well, and the benefits of this technology can be overshadowed by the limitations it imposes as a programming environment for language processing. This paper investigates the strengths and weaknesses of existing technology, focusing on various aspects of large-scale grammar development. Using a real-world case study, we compare a finite-state implementation with an equivalent Java program with respect to ease of development, modularity, maintainability of the code, and space and time efficiency. We identify two main problems, abstraction and incremental development, which are currently not addressed sufficiently well by finite-state technology, and which we believe should be the focus of future research and development.
The semantic annotation of texts with senses from a computational lexicon is a complex and often subjective task. As a matter of fact, the fine granularity of the WordNet sense inventory [Fellbaum, Christiane (ed.). 1998. WordNet: An Electronic Lexical Database MIT Press], a de facto standard within the research community, is one of the main causes of a low inter-tagger agreement ranging between 70% and 80% and the disappointing performance of automated fine-grained disambiguation systems (around 65% state of the art in the Senseval-3 English all-words task). In order to improve the performance of both manual and automated sense taggers, either we change the sense inventory (e.g. adopting a new dictionary or clustering WordNet senses) or we aim at resolving the disagreements between annotators by dealing with the fineness of sense distinctions. The former approach is not viable in the short term, as wide-coverage resources are not publicly available and no large-scale reliable clustering of WordNet senses has been released to date. The latter approach requires the ability to distinguish between subtle or misleading sense distinctions. In this paper, we propose the use of structural semantic interconnections – a specific kind of lexical chains – for the adjudication of disagreed sense assignments to words in context. The approach relies on the exploitation of the lexicon structure as a support to smooth possible divergencies between sense annotators and foster coherent choices. We perform a twofold experimental evaluation of the approach applied to manual annotations from the SemCor corpus, and automatic annotations from the Senseval-3 English all-words competition. Both sets of experiments and results are entirely novel: structural adjudication allows to improve the state-of-the-art performance in all-words disambiguation by 3.3 points (achieving a 68.5% F1-score) and attains figures around 80% precision and 60% recall in the adjudication of disagreements from human annotators.
Two important recent trends in natural language generation are (i) probabilistic techniques and (ii) comprehensive approaches that move away from traditional strictly modular and sequential models. This paper reports experiments in which pcru – a generation framework that combines probabilistic generation methodology with a comprehensive model of the generation space – was used to semi-automatically create five different versions of a weather forecast generator. The generators were evaluated in terms of output quality, development time and computational efficiency against (i) human forecasters, (ii) a traditional handcrafted pipelined nlg system and (iii) a halogen-style statistical generator. The most striking result is that despite acquiring all decision-making abilities automatically, the best pcru generators produce outputs of high enough quality to be scored more highly by human judges than forecasts written by experts.
Feature unification in parsing has previously used either inefficient Prolog programs, or LISP programs implementing early pre-WAM Prolog models of unification involving searches of binding lists, and the copying of rules to generate edges: features within rules and edges have traditionally been expressed as lists or functions, with clarity being preferred to speed of processing. As a result, parsing takes about 0·5 seconds for a 7-word sentence. Our earlier work produced an optimised chart parser for a non-unification context-free-grammar that achieved 5 ms parses, with high-ambiguity sentences involving hundreds of edges, using the grammar and sentences from Tomita's work on shift-reduce parsing with multiple stack branches. A parallel logic card design resulted that would speed this by a further factor of at least 17. The current paper extends this parser to treat a much more complex unification grammar with structures, using extensive indexing of rules and edges and the optimisations of top-down filtering and look-ahead, to demonstrate where unification occurs during parsing. Unification in parsing is distinguished from that in Prolog, and four alternative schemes for storing features and performing unification are considered, including the traditional binding-list method and three other methods optimised for speed for which overall unification times are calculated. Parallelisation of unification using cheap logic hardware is considered, and estimates show that unification will negligibly increase the parse time of our parallel parser card. Preliminary results are reported from a prototype serial parser that uses the fourth most efficient unification method, and achieves 7 ms for 7-word sentences, and under 1 s for a 36-word 360-way ambiguous sentence with 10,000 edges, on a conventional workstation.
Artificial languages for person-machine communication seldom display the most characteristic properties of natural languages, such as the use of anaphoric or other referring expressions, or ellipsis. This paper argues that useful use could be made of such devices in artificial languages, and proposes a mechanism for the resolution of ellipsis and anaphora in them using finite state transduction techniques. This yields an interpretation system displaying many desirable properties: easily implementable, efficient, incremental and reversible.
Linguists in general, and computational linguists in particular, do well to employ finite state devices wherever possible. They are theoretically appealing because they are computationally weak and best understood from a mathematical point of view. They are computationally appealing because they make for simple, elegant, and highly efficient implementations. In this paper, I hope I have shown how they can be applied to a problem… which seems initially to require heavier machinery.
Text-to-speech systems are currently designed to work on complete sentences and paragraphs, thereby allowing front end processors access to large amounts of linguistic context. Problems with this design arise when applications require text to be synthesized in near real time, as it is being typed. How does the system decide which incoming words should be collected and synthesized as a group when prior and subsequent word groups are unknown? We describe a rule-based parser that uses a three cell buffer and phrasing rules to identify break points for incoming text. Words up to the break point are synthesized as new text is moved into the buffer; no hierarchical structure is built beyond the lexical level. The parser was developed for use in a system that synthesizes written telecommunications by Deaf and hard of hearing people. These are texts written entirely in upper case, with little or no punctuation, and using a nonstandard variety of English (e.g. WHEN DO I WILL CALL BACK YOU). The parser performed well in a three month field trial utilizing tens of thousands of texts. Laboratory tests indicate that the parser exhibited a low error rate when compared with a human reader.
This paper identifies some linguistic properties of technical terminology, and uses them to formulate an algorithm for identifying technical terms in running text. The grammatical properties discussed are preferred phrase structures: technical terms consist mostly of noun phrases containing adjectives, nouns, and occasionally prepositions; rerely do terms contain verbs, adverbs, or conjunctions. The discourse properties are patterns of repetition that distinguish noun phrases that are technical terms, especially those multi-word phrases that constitute a substantial majority of all technical vocabulary, from other types of noun phrase.
The paper presents a terminology indentification algorithm that is motivated by these linguistic properties. An implementation of the algorithm is described; it recovers a high proportion of the technical terms in a text, and a high proportaion of the recovered strings are vaild technical terms. The algorithm proves to be effective regardless of the domain of the text to which it is applied.
This contribution focuses on a dialogue model using an intelligent working memory that aims at facilitating a robust human-machine dialogue in written natural language. The model has been designed as the core of an information seeking dialogue application. The particularity of this project is to rely on the potent interpretation and behaviour capabilities of pragmatic knowledge. Within this framework, the designed dialogue model appears as a kind of ‘forum’ for various facets, impersonated by different models extracted from both intentional and structural approaches of conversation. The approach is based on assuming that multiple expertise is the key to flexibility and robustness. Also, an intelligent memory that keeps track of all events and links them together from as many angles as necessary is crucial for multiple expertise management. This idea is developed by presenting an intelligent dialogue history which is able to complement the wide coverage of the co-operating models. It is no longer a simple chronological record, but a communication area, common to all processes. We illustrate our topic through examples brought out from collected corpora.
Inside parsing is a best parse parsing method based on the Inside algorithm that is often used in estimating probabilistic parameters of stochastic context free grammars. It gives a best parse in O(N3G3) time where N is the input size and G is the grammar size. Earley algorithm can be made to return best parses with the same complexity in N.
By way of experiments, we show that Inside parsing can be more efficient than Earley parsing with sufficiently large grammar and sufficiently short input sentences. For instance, Inside parsing is better with sentences of 16 or less words for a grammar containing 429 states. In practice, parsing can be made efficient by employing the two methods selectively.
The redundancy of Inside algorithm can be reduced by the topdown filtering using the chart produced by Earley algorithm, which is useful in training the probabilistic parameters of a grammar. Extensive experiments on Penn Tree corpus show that the efficiency of Inside computation can be improved by up to 55%.
Morphological analysis, which is at the heart of the processing of natural language requires computationally effective morphological processors. In this paper an approach to the organization of an inflectional morphological model and its application for the Russian language are described. The main objective of our morphological processor is not the classification of word constituents, but rather an efficient computational recognition of morpho-syntactic features of words and the generation of words according to requested morpho-syntactic features. Another major concern that the processor aims to address is the ease of extending the lexicon. The templated word-paradigm model used in the system has an engineering flavour: paradigm formation rules are of a bottom-up (word specific) nature rather than general observations about the language, and word formation units are segments of words rather than proper morphemes. This approach allows us to handle uniformly both general cases and exceptions, and requires extremely simple data structures and control mechanisms which can be easily implemented as a finite-state automata. The morphological processor described in this paper is fully implemented for a substantial subset of Russian (more then 1,500,000 word-tokens – 95,000 word paradigms) and provides an extensive list of morpho-syntactic features together with stress positions for words utilized in its lexicon. Special dictionary management tools were built for browsing, debugging and extension of the lexicon. The actual implementation was done in C and C++, and the system is available for the MS-DOS, MS-Windows and UNIX platforms.