To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Every day we interact with machine learning systems offering individualized predictions for our entertainment, social connections, purchases, or health. These involve several modalities of data, from sequences of clicks to text, images, and social interactions. This book introduces common principles and methods that underpin the design of personalized predictive models for a variety of settings and modalities. The book begins by revising 'traditional' machine learning models, focusing on adapting them to settings involving user data, then presents techniques based on advanced principles such as matrix factorization, deep learning, and generative modeling, and concludes with a detailed study of the consequences and risks of deploying personalized predictive systems. A series of case studies in domains ranging from e-commerce to health plus hands-on projects and code examples will give readers understanding and experience with large-scale real-world datasets and the ability to design models and systems for a wide range of applications.
Learning idiomatic expressions is seen as one of the most challenging stages in second-language learning because of their unpredictable meaning. A similar situation holds for their identification within natural language processing applications such as machine translation and parsing. The lack of high-quality usage samples exacerbates this challenge not only for humans but also for artificial intelligence systems. This article introduces a gamified crowdsourcing approach for collecting language learning materials for idiomatic expressions; a messaging bot is designed as an asynchronous multiplayer game for native speakers who compete with each other while providing idiomatic and nonidiomatic usage examples and rating other players’ entries. As opposed to classical crowd-processing annotation efforts in the field, for the first time in the literature, a crowd-creating & crowd-rating approach is implemented and tested for idiom corpora construction. The approach is language-independent and evaluated on two languages in comparison to traditional data preparation techniques in the field. The reaction of the crowd is monitored under different motivational means (namely, gamification affordances and monetary rewards). The results reveal that the proposed approach is powerful in collecting the targeted materials, and although being an explicit crowdsourcing approach, it is found entertaining and useful by the crowd. The approach has been shown to have the potential to speed up the construction of idiom corpora for different natural languages to be used as second-language learning material, training data for supervised idiom identification systems, or samples for lexicographic studies.
Causationin written natural language can express a strong relationship between events and facts. Causation in the written form can be referred to as a causal relation where a cause event entails the occurrence of an effect event. A cause and effect relationship is stronger than a correlation between events, and therefore aggregated causal relations extracted from large corpora can be used in numerous applications such as question-answering and summarisation to produce superior results than traditional approaches. Techniques like logical consequence allow causal relations to be used in niche practical applications such as event prediction which is useful for diverse domains such as security and finance. Until recently, the use of causal relations was a relatively unpopular technique because the causal relation extraction techniques were problematic, and the relations returned were incomplete, error prone or simplistic. The recent adoption of language models and improved relation extractors for natural language such as Transformer-XL (Dai et al. (2019). Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860) has seen a surge of research interest in the possibilities of using causal relations in practical applications. Until now, there has not been an extensive survey of the practical applications of causal relations; therefore, this survey is intended precisely to demonstrate the potential of causal relations. It is a comprehensive survey of the work on the extraction of causal relations and their applications, while also discussing the nature of causation and its representation in text.
We propose an integrated deep learning model for morphological segmentation, morpheme tagging, part-of-speech (POS) tagging, and syntactic parsing onto dependencies, using cross-level contextual information flow for every word, from segments to dependencies, with an attention mechanism at horizontal flow. Our model extends the work of Nguyen and Verspoor ((2018). Proceedings of the CoNLL Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. The Association for Computational Linguistics, pp. 81–91.) on joint POS tagging and dependency parsing to also include morphological segmentation and morphological tagging. We report our results on several languages. Primary focus is agglutination in morphology, in particular Turkish morphology, for which we demonstrate improved performance compared to models trained for individual tasks. Being one of the earlier efforts in joint modeling of syntax and morphology along with dependencies, we discuss prospective guidelines for future comparison.
Authorship attribution – the computational task of identifying the author of a given text document within a set of possible candidates – has been attracting interest in Natural Language Processing research for many years. At the same time, significant advances have also been observed in the related field of author profiling, that is, the computational task of learning author demographics from text such as gender, age and others. The close relation between the two topics – both of which focused on gaining knowledge about the individual who wrote a piece of text – suggests that research in these fields may benefit from each other. To illustrate this, this work addresses the issue of author identification with the aid of author profiling methods, adding demographics predictions to an authorship attribution architecture that may be particularly suitable to extensions of this kind, namely, a stack of classifiers devoted to different aspects of the input text (words, characters and text distortion patterns.) The enriched model is evaluated across a range of text domains, languages and author profiling estimators, and its results are shown to compare favourably to those obtained by a standard authorship attribution method that does not have access to author demographics predictions.
This study describes a Natural Language Processing (NLP) toolkit, as the first contribution of a larger project, for an under-resourced language—Urdu. In previous studies, standard NLP toolkits have been developed for English and many other languages. There is also a dire need for standard text processing tools and methods for Urdu, despite it being widely spoken in different parts of the world with a large amount of digital text being readily available. This study presents the first version of the UNLT (Urdu Natural Language Toolkit) which contains three key text processing tools required for an Urdu NLP pipeline; word tokenizer, sentence tokenizer, and part-of-speech (POS) tagger. The UNLT word tokenizer employs a morpheme matching algorithm coupled with a state-of-the-art stochastic n-gram language model with back-off and smoothing characteristics for the space omission problem. The space insertion problem for compound words is tackled using a dictionary look-up technique. The UNLT sentence tokenizer is a combination of various machine learning, rule-based, regular-expressions, and dictionary look-up techniques. Finally, the UNLT POS taggers are based on Hidden Markov Model and Maximum Entropy-based stochastic techniques. In addition, we have developed large gold standard training and testing data sets to improve and evaluate the performance of new techniques for Urdu word tokenization, sentence tokenization, and POS tagging. For comparison purposes, we have compared the proposed approaches with several methods. Our proposed UNLT, the training and testing data sets, and supporting resources are all free and publicly available for academic use.
Machine Reading Comprehension (MRC) is a challenging task and hot topic in Natural Language Processing. The goal of this field is to develop systems for answering the questions regarding a given context. In this paper, we present a comprehensive survey on diverse aspects of MRC systems, including their approaches, structures, input/outputs, and research novelties. We illustrate the recent trends in this field based on a review of 241 papers published during 2016–2020. Our investigation demonstrated that the focus of research has changed in recent years from answer extraction to answer generation, from single- to multi-document reading comprehension, and from learning from scratch to using pre-trained word vectors. Moreover, we discuss the popular datasets and the evaluation metrics in this field. The paper ends with an investigation of the most-cited papers and their contributions.
Property inference involves predicting properties for a word from its distributional representation. We focus on human-generated resources that link words to their properties and on the task of predicting these properties for unseen words. We introduce the use of label propagation, a semi-supervised machine learning approach, for this task and, in the first systematic study of models for this task, find that label propagation achieves state-of-the-art results. For more variety in the kinds of properties tested, we introduce two new property datasets.
Funding for AI start-ups in general is booming, and natural language processing as a subfield has not missed out. We take a closer look at early-stage funding over the last year—just over US$1B in total—for companies that offer solutions that are based on or make significant use of NLP, providing a picture of what funders think is innovative and bankable in this space, and we make some observations on notable trends and developments.
Stories are typically represented as a set of events and temporal or causal relations among events. In the metro map model of storylines, participants are represented as histories and events as interactions between participant histories. The metro map model calls for a decomposition of events into what each participant does (or what happens to each participant), as well as the interactions among participants. Such a decompositional model of events has been developed in linguistic semantics. Here, we describe this decompositional model of events and how it can be combined with a metro map model of storylines.
This chapter reviews the research conducted on the representation of events, from theperspectives ofnatural language processing, artificial intelligence (AI), and linguistics. AI approaches to modeling change have traditionally focused on situations and state descriptions. Linguistic approaches start with the description of the propositional content of sentences (or natural language expressions generally). As a result, the focus in the two fields has been on different problems. I argue that these approaches have common elements that can be drawn on to view event semantics from a unifying perspective, where we can distinguish between the surface events denoted by verbal predicates and what I refer to as the latent event structure of a sentence. By clearly distinguishing between surface and latent event structures of sentences and texts, we move closer to a general computational theory of event structure, one permitting a common vocabularyfor events and the relations between them, while enabling reasoning at multiple levels of interpretation.
Understanding the timeline of a story is a necessary first step for extracting storylines. This is difficult because timelines are rarely explicitly given in documents, and fragments of a story may be found across multiple documents. We outline prior work and the state of the art in both timeline extraction and alignment of events across documents. Previous work focused mainly on temporal graph extraction rather than actual timelines. Recently, there has been a growing interest in extracting timelines from these graphs. We review this work and describe our own approach that solves timeline extraction exactly. With regard to event alignment, most efforts have focused on the specific task of cross-document event coreference (CDEC). Current approaches to CDEC perform either event-only clustering or joint event–entity clustering, with neural methods achieving the best results. We outline next steps to advance the field toward full timeline alignment across documents that can serve as a foundation for extraction of higher-level, more abstract storylines.
Event extraction aims to find who did what to whom, when, and where from unstructured data. Over the past decade, event extraction has made advances in three waves. The first wave relied on supervised machine learning models trained from a large amount of manually annotated data and crafted features. The second wave introduced deep neural networks with distributional semantic embedding features but still required large annotated data sets. This chapter provides an overview of a third wave with a share-and-transfer framework, which enhances the portability of event extraction by transferring knowledge from a high-resource setting to another low-resource setting, reducing the need for annotated data. The first share step is to construct a common structured semantic representation space into which these complex structures can be encoded. Then, in the transfer step, we train event extractors over these representations in high-resource settings and apply the learned extractors to target data in the low-resource setting. We conclude with a summary of the current status of this new framework and point to remaining challenges and future research directions to address them.
Traditional event detection systems typically extract structured information on events by matching predefined event templates through slot filling. Automatically linking of related event templates extracted from different documents over a longer period of time is of paramount importance for analysts to facilitate situational monitoring and manage the information overload and other long-term data aggregation tasks. This chapter reports on exploring the usability of various machine learning techniques, textual, and metadata features to train classifiers for automatically linking related event templates from online news. In particular, we focus on linking security-related events, including natural and man-made disasters, social and political unrest, military actions and crimes. With the best models trained on moderate-size corpus (ca. 22,000 event pairs) that use solely textual features, one could achieve an F1 score of93.6%. This figure is further improved to 96.7% by inclusion of event metadata features, mainly thanks to the strong discriminatory power of automatically extracted geographical information related to events.
Witness testimony provides the first draft of history and requires a kind of reading that connects descriptions of events from many perspectives and sources. This chapter examines one critical step in that connective process, namely, how to assess a speaker's certainty about the events they describe. By surveying a group of approximately 300 readers and their approximately 28,000 decisions about speaker certainty, this chapter explores how readers may think about factual and counterfactual statements, and how they interpret the certainty with which a witness makes their statements. Ultimately, this chapter argues that readers of collections of witness testimony were more likely to agree about event descriptions when those providing the description were certain, and that readers' abilities to accept gradations of certainty were better when a witness described factual, rather than counterfactual or negated events. These findings lead to a suggestion for how researchers in natural language processing could better model the question of speaker certainty, at least when dealing with the kind of narrative nonfiction one finds in witness testimony.