We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Large language models are a powerful tool for conducting text analysis in political science, but using them to annotate text has several drawbacks, including high cost, limited reproducibility, and poor explainability. Traditional supervised text classifiers are fast and reproducible, but require expensive hand annotation, which is especially difficult for rare classes. This article proposes using LLMs to generate synthetic training data for training smaller, traditional supervised text models. Synthetic data can augment limited hand annotated data or be used on its own to train a classifier with good performance and greatly reduced cost. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, a simple technique for improving the quality of synthetic text, and an illustration of its limitations. I demonstrate the usefulness of synthetic training through three validations: synthetic news articles describing police responses to communal violence in India for training an event detection system, a multilingual corpus of synthetic populist manifesto statements for training a sentence-level populism classifier, and generating synthetic tweets describing the fighting in Ukraine to improve a named entity system.
This paper focuses on the epistemic situation one faces when using a Large Language Model based chatbot like ChatGPT: When reading the output of the chatbot, how should one decide whether or not to believe it? By surveying strategies we use with other, more familiar sources of information, I argue that chatbots present a novel challenge. This makes the question of how one could trust a chatbot especially vexing.
We demonstrate how few-shot prompts to large language models (LLMs) can be effectively applied to a wide range of text-as-data tasks in political science—including sentiment analysis, document scaling, and topic modeling. In a series of pre-registered analyses, this approach outperforms conventional supervised learning methods without the need for extensive data pre-processing or large sets of labeled training data. Performance is comparable to expert and crowd-coding methods at a fraction of the cost. We propose a set of best practices for adapting these models to social science measurement tasks, and develop an open-source software package for researchers.
The Paper aims to explore whether the current ecosystem of existing and still-to-be adopted rules on artificial intelligence (AI) systems in the European Union does fully and adequately address the liability for damages caused by Generative AI system. It maps first and primarily the distinctive features and functional characteristics of Generative AI likely to impact on regulatory and legal considerations and, in particular, on determining the specific regulatory regime governing Generative AI and on tracing and allocating liability along the value chain pursuant to the AI Act. On the basis of this mapping exercise, the Paper focuses on testing the liability rules as provided for the draft Artificial Intelligence Liability Directive and the Revised Product Liability Directive and assessing their sufficiency and effectiveness in the face of Generative AI. Beyond the assessment of the rules laid down in the above-referred texts, the Paper briefly describes other liability scenarios to be explored in future works.
The rapid development of generative artificial intelligence (AI) systems, particularly those fuelled by increasingly advanced large language models, has raised concerns of their potential risks among policymakers globally. In July 2023, Chinese regulators enacted the Interim Measures for the Management of Generative AI Services (“the Measures”). The Measures aim to mitigate various risks associated with public-facing generative AI services, particularly those concerning information content safety and security. China’s approach to regulating AI to date has sought to address the risks associated with rapidly advancing AI technologies while fostering innovation and development. Tensions between these policy objectives are reflected in the provisions of the Measures. As Beijing moves towards establishing a comprehensive legal framework for AI governance, there will be growing interest in how China’s approach may influence AI governance and regulation at a global level.
Large or “foundation” models are now being widely used to generate not just text and images but also video, music and code from prompts. Although this “generative AI” revolution is clearly driving new opportunities for innovation and creativity, it is also enabling easy and rapid dissemination of harmful speech and potentially infringing existing laws. Much attention has been paid recently to how we can draft bespoke legislation to control these risks and harms; however, private ordering by generative AI providers, via user contracts, licenses and privacy policies, has so far attracted less attention. Drawing on the extensive history of study of the terms and conditions (T&C) and privacy policies of social media companies, this paper reports the results of pilot empirical work conducted in January–March 2023, in which T&C were mapped across a representative sample of generative AI. With the focus on copyright and data protection, our early findings indicate the emergence of a “platformisation paradigm,” in which providers of generative AI attempt to position themselves as neutral intermediaries. This study concludes that new laws targeting “big tech” must be carefully reconsidered to avoid repeating past power imbalances between users and platforms.
This Element covers the interaction of two research areas: linguistic semantics and deep learning. It focuses on three phenomena central to natural language interpretation: reasoning and inference; compositionality; extralinguistic grounding. Representation of these phenomena in recent neural models is discussed, along with the quality of these representations and ways to evaluate them (datasets, tests, measures). The Element closes with suggestions on possible deeper interactions between theoretical semantics and language technology based on deep learning models.
Degrees, unlike entities or events, refer to comparative qualities and are closely tied to gradable adjectives such as “tall.” Degree expressions have been explored in second language (L2) research, covering areas such as learnability, first language (L1) transfer, contrastive analysis, and acquisition difficulty. However, a computational approach to the learning of degree expressions in L2 contexts, particularly for L1 Chinese learners of English, has not been thoroughly investigated. This study aims to fill this gap by utilizing natural language processing (NLP) methods, drawing insights from recent advancements in large language models (LLMs). This study extends Cong (2024)’s general-purpose assessment pipeline to specifically analyze degree expressions, predicting that surprisal metrics will correlate with proficiency levels and distinct developmental stages of L2 learners. Crucially, we address the limitations of surprisal metrics in capturing underuse or avoidance—common in L2 development—by integrating frequency-based analyses. Using an NLP pipeline developed with Stanza, we automatically identified and analyzed degree expressions, constructing linear mixed-effects models to track L2 development trajectories. Our findings reveal that as proficiency increases, learners use complex degree expressions more frequently, supporting theories linking difficulty and learnability. Higher surprisal values are associated with lower proficiency in using degree expressions, and these surprisals are more predictive of degree expressions proficiency than classic NLP measures. These results add further evidence that LLMs and NLP tools provide valuable insights into L2 development, specifically in the domain of degree expressions, expanding upon previous research and offering new approaches for understanding L2 learning processes.
Originating from a unique partnership between data scientists (datavaluepeople) and peacebuilders (Build Up), this commentary explores an innovative methodology to overcome key challenges in social media analysis by developing customized text classifiers through a participatory design approach, engaging both peace practitioners and data scientists. It advocates for researchers to focus on developing frameworks that prioritize being usable and participatory in field settings, rather than perfect in simulation. Focusing on a case study investigating the polarization within online Christian communities in the United States, we outline a testing process with a dataset consisting of 8954 tweets and 10,034 Facebook posts to experiment with active learning methodologies aimed at enhancing the efficiency and accuracy of text classification. This commentary demonstrates that the inclusion of domain expertise from peace practitioners significantly refines the design and performance of text classifiers, enabling a deeper comprehension of digital conflicts. This collaborative framework seeks to transition from a data-rich, analysis-poor scenario to one where data-driven insights robustly inform peacebuilding interventions.
Recent advances in large language models (LLMs), such as GPT-4, have spurred interest in their potential applications across various fields, including actuarial work. This paper introduces the use of LLMs in actuarial and insurance-related tasks, both as direct contributors to actuarial modelling and as workflow assistants. It provides an overview of LLM concepts and their potential applications in actuarial science and insurance, examining specific areas where LLMs can be beneficial, including a detailed assessment of the claims process. Additionally, a decision framework for determining the suitability of LLMs for specific tasks is presented. Case studies with accompanying code showcase the potential of LLMs to enhance actuarial work. Overall, the results suggest that LLMs can be valuable tools for actuarial tasks involving natural language processing or structuring unstructured data and as workflow and coding assistants. However, their use in actuarial work also presents challenges, particularly regarding professionalism and ethics, for which high-level guidance is provided.
Scientists have started to explore whether novel artificial intelligence (AI) tools based on large language models, such as GPT-4, could support the scientific peer review process. We sought to understand (i) whether AI versus human reviewers are able to distinguish between made-up AI-generated and human-written conference abstracts reporting on actual research, and (ii) how the quality assessments by AI versus human reviewers of the reported research correspond to each other. We conducted a large-scale field experiment during a medium-sized scientific conference, relying on 305 human-written and 20 AI-written abstracts that were reviewed either by AI or 217 human reviewers. The results show that human reviewers and GPTZero were better in discerning (AI vs. human) authorship than GPT-4. Regarding quality assessments, there was rather low agreement between both human–human and human–AI reviewer pairs, but AI reviewers were more aligned with human reviewers in classifying the very best abstracts. This indicates that AI could become a prescreening tool for scientific abstracts. The results are discussed with regard to the future development and use of AI tools during the scientific peer review process.
Attempts to use artificial intelligence (AI) in psychiatric disorders show moderate success, highlighting the potential of incorporating information from clinical assessments to improve the models. This study focuses on using large language models (LLMs) to detect suicide risk from medical text in psychiatric care.
Aims
To extract information about suicidality status from the admission notes in electronic health records (EHRs) using privacy-sensitive, locally hosted LLMs, specifically evaluating the efficacy of Llama-2 models.
Method
We compared the performance of several variants of the open source LLM Llama-2 in extracting suicidality status from 100 psychiatric reports against a ground truth defined by human experts, assessing accuracy, sensitivity, specificity and F1 score across different prompting strategies.
Results
A German fine-tuned Llama-2 model showed the highest accuracy (87.5%), sensitivity (83.0%) and specificity (91.8%) in identifying suicidality, with significant improvements in sensitivity and specificity across various prompt designs.
Conclusions
The study demonstrates the capability of LLMs, particularly Llama-2, in accurately extracting information on suicidality from psychiatric records while preserving data privacy. This suggests their application in surveillance systems for psychiatric emergencies and improving the clinical management of suicidality by improving systematic quality control and research.
This paper addresses the optimization of retrieval-augmented generation (RAG) processes by exploring various methodologies, including advanced RAG methods. The research, driven by the need to enhance RAG processes as highlighted by recent studies, involved a grid-search optimization of 23,625 iterations. We evaluated multiple RAG methods across different vectorstores, embedding models, and large language models, using cross-domain datasets and contextual compression filters. The findings emphasize the importance of balancing context quality with similarity-based ranking methods, as well as understanding tradeoffs between similarity scores, token usage, runtime, and hardware utilization. Additionally, contextual compression filters were found to be crucial for efficient hardware utilization and reduced token consumption, despite the evident impacts on similarity scores, which may be acceptable depending on specific use cases and RAG methods.
Hierarchical text classification (HTC) is a natural language processing task which aims to categorise a text document into a set of classes from a hierarchical class structure. Recent approaches to solve HTC tasks focus on leveraging pre-trained language models (PLMs) and the hierarchical class structure by allowing these components to interact in various ways. Specifically, the Hierarchy-aware Prompt Tuning (HPT) method has proven to be effective in applying the prompt tuning paradigm to Bidirectional Encoder Representations from Transformers (BERT) models for HTC tasks. Prompt tuning aims to reduce the gap between the pre-training and fine-tuning phases by transforming the downstream task into the pre-training task of the PLM. Discriminative PLMs, which use a replaced token detection (RTD) pre-training task, have also shown to perform better on flat text classification tasks when using prompt tuning instead of vanilla fine-tuning. In this paper, we propose the Hierarchy-aware Prompt Tuning for Discriminative PLMs (HPTD) approach which injects the HTC task into the RTD task used to pre-train discriminative PLMs. Furthermore, we make several improvements to the prompt tuning approach of discriminative PLMs that enable HTC tasks to scale to much larger hierarchical class structures. Through comprehensive experiments, we show that our method is robust and outperforms current state-of-the-art approaches on two out of three HTC benchmark datasets.
The English Preposing in PP construction (PiPP; e.g., Happy though/as we were) is extremely rare but displays an intricate set of stable syntactic properties. How do people become proficient with this construction despite such limited evidence? It is tempting to posit innate learning mechanisms, but present-day large language models seem to learn to represent PiPPs as well, even though such models employ only very general learning mechanisms and experience very few instances of the construction during training. This suggests an alternative hypothesis on which knowledge of more frequent constructions helps shape knowledge of PiPPs. I seek to make this idea precise using model-theoretic syntax (MTS). In MTS, a grammar is essentially a set of constraints on forms. In this context, PiPPs can be seen as arising from a mix of construction-specific and general-purpose constraints, all of which seem inferable from general linguistic experience.
Growth in the complexity of advanced systems is mirrored by a growth in the number of engineering requirements and related upstream and downstream tasks. These requirements are typically expressed in natural language and require human expertise to manage. Natural language processing (NLP) technology has long been seen as promising to increase requirements engineering (RE) productivity but has yet to demonstrate substantive benefits. The recent addition of large language models (LLMs) to the NLP toolbox is now generating renewed enthusiasm in the hope that it will overcome past shortcomings. This article scrutinizes this claim by reviewing the application of LLMs for engineering requirements tasks. We survey the success of applying LLMs and the scale to which they have been used. We also identify groups of challenges shared across different engineering requirement tasks. These challenges show how this technology has been applied to RE tasks that need reassessment. We finalize by drawing a parallel to other engineering fields with similar challenges and how they have been overcome in the past – and suggest these as future directions to be investigated.
Machine learning, especially neural network methods, is increasingly important in network analysis. This chapter will discuss the theoretical aspects of network embedding methods and graph neural networks. As we have seen, much of the success of advanced machine learning is thanks to useful representations—embeddings—of data. Embedding and machine learning are closely aligned. Translating network elements to embedding vectors and sending those vectors as features to a predictive model often leads to a simpler, more performant model than trying to work directly with the network. Embeddings help with network learning tasks, from node classification to link prediction. We can even embed entire networks and then use models to summarize and compare networks. But not only does machine learning benefit from embeddings, but embeddings benefit from machine learning. Inspired by the incredible recent progress with natural language data, embeddings created by predictive models are becoming more useful and important. Often these embeddings are produced by neural networks of various flavors, and we explore current approaches for using neural networks on network data.
Incarceration is a significant social determinant of health, contributing to high morbidity, mortality, and racialized health inequities. However, incarceration status is largely invisible to health services research due to inadequate clinical electronic health record (EHR) capture. This study aims to develop, train, and validate natural language processing (NLP) techniques to more effectively identify incarceration status in the EHR.
Methods:
The study population consisted of adult patients (≥ 18 y.o.) who presented to the emergency department between June 2013 and August 2021. The EHR database was filtered for notes for specific incarceration-related terms, and then a random selection of 1,000 notes was annotated for incarceration and further stratified into specific statuses of prior history, recent, and current incarceration. For NLP model development, 80% of the notes were used to train the Longformer-based and RoBERTa algorithms. The remaining 20% of the notes underwent analysis with GPT-4.
Results:
There were 849 unique patients across 989 visits in the 1000 annotated notes. Manual annotation revealed that 559 of 1000 notes (55.9%) contained evidence of incarceration history. ICD-10 code (sensitivity: 4.8%, specificity: 99.1%, F1-score: 0.09) demonstrated inferior performance to RoBERTa NLP (sensitivity: 78.6%, specificity: 73.3%, F1-score: 0.79), Longformer NLP (sensitivity: 94.6%, specificity: 87.5%, F1-score: 0.93), and GPT-4 (sensitivity: 100%, specificity: 61.1%, F1-score: 0.86).
Conclusions:
Our advanced NLP models demonstrate a high degree of accuracy in identifying incarceration status from clinical notes. Further research is needed to explore their scaled implementation in population health initiatives and assess their potential to mitigate health disparities through tailored system interventions.
This article examines the idea of mind-reading technology by focusing on an interesting case of applying a large language model (LLM) to brain data. On the face of it, experimental results appear to show that it is possible to reconstruct mental contents directly from brain data by processing via a chatGPT-like LLM. However, the author argues that this apparent conclusion is not warranted. Through examining how LLMs work, it is shown that they are importantly different from natural language. The former operates on the basis of nonrational data transformations based on a large textual corpus. The latter has a rational dimension, being based on reasons. Using this as a basis, it is argued that brain data does not directly reveal mental content, but can be processed to ground predictions indirectly about mental content. The author concludes that this is impressive but different in principle from technology-mediated mind reading. The applications of LLM-based brain data processing are nevertheless promising for speech rehabilitation or novel communication methods.
To what extent can statistical language knowledge account for the effects of world knowledge in language comprehension? We address this question by focusing on a core aspect of language understanding: pronoun resolution. While existing studies suggest that comprehenders use world knowledge to resolve pronouns, the distributional hypothesis and its operationalization in large language models (LLMs) provide an alternative account of how purely linguistic information could drive apparent world knowledge effects. We addressed these confounds in two experiments. In Experiment 1, we found a strong effect of world knowledge plausibility (measured using a norming study) on responses to comprehension questions that probed pronoun interpretation. In experiment 2, participants were slower to read continuations that contradicted world knowledge-consistent interpretations of a pronoun, implying that comprehenders deploy world knowledge spontaneously. Both effects persisted when controlling for the predictions of GPT-3, an LLM, suggesting that pronoun interpretation is at least partly driven by knowledge about the world and not the word. We propose two potential mechanisms by which knowledge-driven pronoun resolution occurs, based on validation- and expectation-driven discourse processes. The results suggest that while distributional information may capture some aspects of world knowledge, human comprehenders likely draw on other sources unavailable to LLMs.