To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Lexical semantic competence is a multifaceted and complex reality, which includes the ability of drawing inferences, distinguishing different word senses, referring to the entities in the world, and so on. A long-standing tradition of research in linguistics and cognitive science has investigated these issues using symbolic representations. The aim of this chapter is to understand how and to what extent the major aspects of lexical meaning can be addressed with distributional representations. We have selected a group of research topics that have received particular attention in distributional semantics: (i) identifying and representing multiple meanings of lexical items, (ii) discriminating between different paradigmatic semantic relations, (iii) establishing cross-lingual links among lexemes, (iv) analyzing connotative aspects of meaning, (v) studying semantic change, (vi) grounding distributional representations in extralinguistic experiential data, and (vii) using distributional vectors in cognitive science to model the mental lexicon and semantic memory.
This chapter presents current research in compositional distributional semantics, which aims at designing methods to construct the interpretation of complex linguistic expressions from the distributional representations of the lexical items they contain. This theme includes two major questions that we are going to explore: What is the distributional representation of a phrase or sentence and to what extent it is able to encode key aspects of its meaning? How can we build such representations compositionally? After introducing the classical symbolic paradigm of compositionality based on function-argument structures and function application, we review different methods to create phrase and sentence vectors (simple vector operations, neural networks trained to learn sentence embeddings, etc.). Then, we investigate the context-sensitive nature of semantic representations, with a particular focus on the last generation of contextual embeddings, and distributional models of selectional preferences. We end with some general considerations about compositionality, semantic structures, and vector models of meaning.
The distributional representation of a lexical item is typically a vector representing its co-occurrences with linguistic contexts. This chapter introduces the basic notions to construct distributional semantic representations from corpora. We present (i) the major types of linguistic contexts used to characterize the distributional properties of lexical items (e.g., window-based and syntactic collocates and documents) , (ii) their representation with co-occurrence matrices, whose rows are labeled with lexemes and columns with contexts, (iii) mathematical methods to weight the importance of contexts (e.g., Pointwise Mutual Information and entropy), ( iv) the distinction between high-dimensional explicit vectors and low-dimensional embeddings with latent dimensions, (v) dimensionality reduction methods to generate embeddings from the original co-occurrence matrix (e.g., Singular Value Decomposition), and (vi) vector similarity measures (e.g., cosine similarity).
This chapter contains a synoptic view of the different types and generations of distributional semantic models (DSMs), including the distinction between static and contextual models. Part II then focuses on static DSMs, since they are still the best known and widely studied family of models, and they learn context-independent distributional representations that are useful for several linguistic and cognitive tasks.
Neural machine translation is not neutral. The increased linguistic fluency and naturalness as the hallmark of neural machine translation sometimes runs the risk of trans-creation, which bends the true meaning of the source text to accommodate the conventionalized, preferred use and interpretation of concepts, terms and expressions in the target language and cultural system. This chapter explores the cultural and linguistic bias of neural machine translation of English educational resources on mental health and well-being, highlighting the urgent need to develop and redesign machine translation systems to produce more neutral and balanced machine translation outputs for global end users, especially people from vulnerable social backgrounds.
Access to healthcare profoundly impacts the health and quality of life of Deaf people. Automatic translation tools are crucial in improving communication between Deaf patients and their healthcare providers. The aim of this chapter is to present the pipeline used to create the Swiss-French Sign Language (LSF-CH) version of BabelDr, a speech-enabled fixed phrase translator that was initially conceived to improve communication in emergency settings between doctors and allophone patients (Bouillon et al., 2021). In order to do so, we start off by explaining how we ported BabelDr in LSF-CH using both human and avatar videos. We first describe the creation of a reference corpus consisting of video translations done by human translators, then we present a second corpus of videos generated with a virtual human. Finally, we relate the findings of a questionnaire on Deaf users’ perspective on the use of signing avatars in the medical context. We showed that, although respondents prefer human videos, the use of automatic technologies associated with virtual characters is not without interest to the target audience and can be useful to them in the medical context.
Understanding the nature of meaning and its extensions (with metaphor as one typical kind) has been one core issue in figurative language study since Aristotle’s time. This research takes a computational cognitive perspective to model metaphor based on the assumption that meaning is perceptual, embodied, and encyclopedic. We model word meaning representation for metaphor detection with embodiment information obtained from behavioral experiments. Our work is the first attempt to incorporate sensorimotor knowledge into neural networks for metaphor detection, and demonstrates superiority, consistency, and interpretability compared to peer systems based on two general datasets. In addition, with cross-sectional analysis of different feature schemas, our results suggest that metaphor, as a device of cognitive conceptualization, can be ‘learned’ from the perceptual and actional information independent of several more explicit levels of linguistic representation. The access to such knowledge allows us to probe further into word meaning mapping tendencies relevant to our conceptualization and reaction to the physical world.
Large language models (LLMs) have achieved amazing successes. They have done well on standardized tests in medicine and the law. That said, the bar has been raised so high that it could take decades to make good on expectations. To buy time for this long-term research program, the field needs to identify some good short-term applications for smooth-talking machines that are more fluent than trustworthy.
Conversational recommender system (CRS) needs to be seamlessly integrated between the two modules of recommendation and dialog, aiming to recommend high-quality items to users through multiple rounds of interactive dialogs. Items can typically refer to goods, movies, news, etc. Through this form of interactive dialog, users can express their preferences in real time, and the system can fully understand the user’s thoughts and recommend corresponding items. Although mainstream dialog recommendation systems have improved the performance to some extent, there are still some key issues, such as insufficient consideration of the entity’s order in the dialog, the different contributions of items in the dialog history, and the low diversity of generated responses. To address these shortcomings, we propose an improved dialog context model based on time-series features. Firstly, we augment the semantic representation of words and items using two external knowledge graphs and align the semantic space using mutual information maximization techniques. Secondly, we add a retrieval model to the dialog recommendation system to provide auxiliary information for generating replies. We then utilize a deep timing network to serialize the dialog content and more accurately learn the feature relationship between users and items for recommendation. In this paper, the dialog recommendation system is divided into two components, and different evaluation indicators are used to evaluate the performance of the dialog component and the recommendation component. Experimental results on widely used benchmarks show that the proposed method is effective.
Distributional semantics develops theories and methods to represent the meaning of natural language expressions, with vectors encoding their statistical distribution in linguistic contexts. It is at once a theoretical model to express meaning, a practical methodology to construct semantic representations, a computational framework for acquiring meaning from language data, and a cognitive hypothesis about the role of language usage in shaping meaning. This book aims to build a common understanding of the theoretical and methodological foundations of distributional semantics. Beginning with its historical origins, the text exemplifies how the distributional approach is implemented in distributional semantic models. The main types of computational models, including modern deep learning ones, are described and evaluated, demonstrating how various types of semantic issues are addressed by those models. Open problems and challenges are also analyzed. Students and researchers in natural language processing, artificial intelligence, and cognitive science will appreciate this book.
Digital health translation is an important application of machine translation and multilingual technologies, and there is a growing need for accessibility in digital health translation design for disadvantaged communities. This book addresses that need by highlighting state-of-the-art research on the design and evaluation of assistive translation tools, along with systems to facilitate cross-cultural and cross-lingual communications in health and medical settings. Using case studies as examples, the principles of designing assistive health communication tools are illustrated. These are (1) detectability of errors to boost user confidence by health professionals; (2) customizability for health and medical domains; (3) inclusivity of translation modalities to serve people with disabilities; and (4) equality of accessibility standards for localised multilingual websites of health contents. This book will appeal to readers from natural language processing, computer science, linguistics, translation studies, public health, media, and communication studies. This title is available as open access on Cambridge Core.
Identifying and annotating toxic online content on social media platforms is an extremely challenging problem. Work that studies toxicity in online content has predominantly focused on comments as independent entities. However, comments on social media are inherently conversational, and therefore, understanding and judging the comments fundamentally requires access to the context in which they are made. We introduce a study and resulting annotated dataset where we devise a number of controlled experiments on the importance of context and other observable confounders – namely gender, age and political orientation – towards the perception of toxicity in online content. Our analysis clearly shows the significance of context and the effect of observable confounders on annotations. Namely, we observe that the ratio of toxic to non-toxic judgements can be very different for each control group, and a higher proportion of samples are judged toxic in the presence of contextual information.
GPT-3 is a large-scale natural language model developed by OpenAI that can perform many different tasks, including topic classification. Although researchers claim that it requires only a small number of in-context examples to learn a task, in practice GPT-3 requires these training examples to be either of exceptional quality or a higher quantity than easily created by hand. To address this issue, this study teaches GPT-3 to classify whether a question is related to data science by augmenting a small training set with additional examples generated by GPT-3 itself. This study compares two augmented classifiers: the Classification Endpoint with an increased training set size and the Completion Endpoint with an augmented prompt optimized using a genetic algorithm. We find that data augmentation significantly increases the accuracy of both classifiers, and that the embedding-based Classification Endpoint achieves the best accuracy of about 76%, compared to human accuracy of 85%. In this way, giving large language models like GPT-3 the ability to propose their own training examples can improve short text classification performance.
Recent developments in text style transfer have led this field to be more highlighted than ever. There are many challenges associated with transferring the style of input text such as fluency and content preservation that need to be addressed. In this research, we present PGST, a novel Persian text style transfer approach in the gender domain, composed of different constituent elements. Established on the significance of parts of speech tags, our method is the first that successfully transfers the gendered linguistic style of Persian text. We have proceeded with a pre-trained word embedding for token replacement purposes, a character-based token classifier for gender exchange purposes, and a beam search algorithm for extracting the most fluent combination. Since different approaches are introduced in our research, we determine a trade-off value for evaluating different models’ success in faking our gender identification model with transferred text. Our research focuses primarily on Persian, but since there is no Persian baseline available, we applied our method to a highly studied gender-tagged English corpus and compared it to state-of-the-art English variants to demonstrate its applicability. Our final approach successfully defeated English and Persian gender identification models by 45.6% and 39.2%, respectively.
This paper explores how to syntactically parse Ancient Greek texts automatically and maps ways of fruitfully employing the results of such an automated analysis. Special attention is given to documentary papyrus texts, a large diachronic corpus of non-literary Greek, which presents a unique set of challenges to tackle. By making use of the Stanford Graph-Based Neural Dependency Parser, we show that through careful curation of the parsing data and several manipulation strategies, it is possible to achieve an Labeled Attachment Score of about 0.85 for this corpus. We also explain how the data can be converted back to its original (Ancient Greek Dependency Treebanks) format. We describe the results of several tests we have carried out to improve parsing results, with special attention paid to the impact of the annotation format on parser achievements. In addition, we offer a detailed qualitative analysis of the remaining errors, including possible ways to solve them. Moreover, the paper gives an overview of the valorisation possibilities of an automatically annotated corpus of Ancient Greek texts in the fields of linguistics, language education and humanities studies in general. The concluding section critically analyses the remaining difficulties and outlines avenues to further improve the parsing quality and the ensuing practical applications.