We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This enthusiastic introduction to the fundamentals of information theory builds from classical Shannon theory through to modern applications in statistical learning, equipping students with a uniquely well-rounded and rigorous foundation for further study. Introduces core topics such as data compression, channel coding, and rate-distortion theory using a unique finite block-length approach. With over 210 end-of-part exercises and numerous examples, students are introduced to contemporary applications in statistics, machine learning and modern communication theory. This textbook presents information-theoretic methods with applications in statistical learning and computer science, such as f-divergences, PAC Bayes and variational principle, Kolmogorov's metric entropy, strong data processing inequalities, and entropic upper bounds for statistical estimation. Accompanied by a solutions manual for instructors, and additional standalone chapters on more specialized topics in information theory, this is the ideal introductory textbook for senior undergraduate and graduate students in electrical engineering, statistics, and computer science.
Super-resolution of turbulence is a term used to describe the prediction of high-resolution snapshots of a flow from coarse-grained observations. This is typically accomplished with a deep neural network and training usually requires a dataset of high-resolution images. An approach is presented here in which robust super-resolution can be performed without access to high-resolution reference data, as might be expected in an experiment. The training procedure is similar to data assimilation, wherein the model learns to predict an initial condition that leads to accurate coarse-grained predictions at later times, while only being shown coarse-grained observations. Implementation of the approach requires the use of a fully differentiable flow solver in the training loop to allow for time-marching of predictions. A range of models are trained on data generated from forced, two-dimensional turbulence. The networks have reconstruction errors which are similar to those obtained with ‘standard’ super-resolution approaches using high-resolution data. Furthermore, the methods are comparable to the performance of standard data assimilation for state estimation on individual trajectories, outperforming these variational approaches at initial time and remaining robust when unrolled in time where performance of the standard data-assimilation algorithm improves.
Rapid urbanization poses several challenges, especially when faced with an uncontrolled urban development plan. Therefore, it often leads to anarchic occupation and expansion of cities, resulting in the phenomenon of urban sprawl (US). To support sustainable decision–making in urban planning and policy development, a more effective approach to addressing this issue through US simulation and prediction is essential. Despite the work published in the literature on the use of deep learning (DL) methods to simulate US indicators, almost no work has been published to assess what has already been done, the potential, the issues, and the challenges ahead. By synthesising existing research, we aim to assess the current landscape of the use of DL in modelling US. This article elucidates the complexities of US, focusing on its multifaceted challenges and implications. Through an examination of DL methodologies, we aim to highlight their effectiveness in capturing the complex spatial patterns and relationships associated with US. This work begins by demystifying US, highlighting its multifaceted challenges. In addition, the article examines the synergy between DL and conventional methods, highlighting the advantages and disadvantages. It emerges that the use of DL in the simulation and forecasting of US indicators is increasing, and its potential is very promising for guiding strategic decisions to control and mitigate this phenomenon. Of course, this is not without major challenges, both in terms of data and models and in terms of strategic city planning policies.
Risk-based surveillance is now a well-established paradigm in epidemiology, involving the distribution of sampling efforts differentially in time, space, and within populations, based on multiple risk factors. To assess and map the risk of the presence of the bacterium Xylella fastidiosa, we have compiled a dataset that includes factors influencing plant development and thus the spread of such harmful organism. To this end, we have collected, preprocessed, and gathered information and data related to land types, soil compositions, and climatic conditions to predict and assess the probability of risk associated with X. fastidiosa in relation to environmental features. This resource can be of interest to researchers conducting analyses on X. fastidiosa and, more generally, to researchers working on geospatial modeling of risk related to plant infectious diseases.
Both energy performance certificates (EPCs) and thermal infrared (TIR) images play key roles in mapping the energy performance of the urban building stock. In this paper, we developed parametric building archetypes using an EPC database and conducted temperature clustering on TIR images acquired from drones and satellite datasets. We evaluated 1,725 EPCs of existing building stock in Cambridge, UK, to generate energy consumption profiles. Drone-based TIR images of individual buildings in two Cambridge University colleges were processed using a machine learning pipeline for thermal anomaly detection and investigated the influence of two specific factors that affect the reliability of TIR for energy management applications: ground sample distance (GSD) and angle of view (AOV). The EPC results suggest that the construction year of the buildings influences their energy consumption. For example, modern buildings were over 30% more energy-efficient than older ones. In parallel, older buildings were found to show almost double the energy savings potential through retrofitting compared to newly constructed buildings. TIR imaging results showed that thermal anomalies can only be properly identified in images with a GSD of 1 m/pixel or less. A GSD of 1-6 m/pixel can detect hot areas of building surfaces. We found that a GSD > 6 m/pixel cannot characterize individual buildings but does help identify urban heat island effects. Additional sensitivity analysis showed that building thermal anomaly detection is more sensitive to AOV than to GSD. Our study informs newer approaches to building energy diagnostics using thermography and supports decision-making for large-scale retrofitting.
Machine learning (ML) techniques have emerged as a powerful tool for predicting weather and climate systems. However, much of the progress to date focuses on predicting the short-term evolution of the atmosphere. Here, we look at the potential for ML methodology to predict the evolution of the ocean. The presence of land in the domain is a key difference between ocean modeling and previous work looking at atmospheric modeling. Here, we look to train a convolutional neural network (CNN) to emulate a process-based General Circulation Model (GCM) of the ocean, in a configuration which contains land. We assess performance on predictions over the entire domain and near to the land (coastal points). Our results show that the CNN replicates the underlying GCM well when assessed over the entire domain. RMS errors over the test dataset are low in comparison to the signal being predicted, and the CNN model gives an order of magnitude improvement over a persistence forecast. When we partition the domain into near land and the ocean interior and assess performance over these two regions, we see that the model performs notably worse over the near land region. Near land, RMS scores are comparable to those from a simple persistence forecast. Our results indicate that ocean interaction with land is something the network struggles with and highlight that this is may be an area where advanced ML techniques specifically designed for, or adapted for, the geosciences could bring further benefits.
Atmospheric models used for weather and climate prediction are traditionally formulated in a deterministic manner. In other words, given a particular state of the resolved scale variables, the most likely forcing from the subgrid scale processes is estimated and used to predict the evolution of the large-scale flow. However, the lack of scale separation in the atmosphere means that this approach is a large source of error in forecasts. Over recent years, an alternative paradigm has developed: the use of stochastic techniques to characterize uncertainty in small-scale processes. These techniques are now widely used across weather, subseasonal, seasonal, and climate timescales. In parallel, recent years have also seen significant progress in replacing parametrization schemes using machine learning (ML). This has the potential to both speed up and improve our numerical models. However, the focus to date has largely been on deterministic approaches. In this position paper, we bring together these two key developments and discuss the potential for data-driven approaches for stochastic parametrization. We highlight early studies in this area and draw attention to the novel challenges that remain.
Climate models are biased with respect to real-world observations. They usually need to be adjusted before being used in impact studies. The suite of statistical methods that enable such adjustments is called bias correction (BC). However, BC methods currently struggle to adjust temporal biases. Because they mostly disregard the dependence between consecutive time points. As a result, climate statistics with long-range temporal properties, such as the number of heatwaves and their frequency, cannot be corrected accurately. This makes it more difficult to produce reliable impact studies on such climate statistics. This article offers a novel BC methodology to correct temporal biases. This is made possible by rethinking the philosophy behind BC. We will introduce BC as a time-indexed regression task with stochastic outputs. Rethinking BC enables us to adapt state-of-the-art machine learning (ML) attention models and thereby learn different types of biases, including temporal asynchronicities. With a case study of number of heatwaves in Abuja, Nigeria and Tokyo, Japan, we show more accurate results than current climate model outputs and alternative BC methods.
This paper presents a machine learning approach to multidimensional item response theory (MIRT), a class of latent factor models that can be used to model and predict student performance from observed assessment data. Inspired by collaborative filtering, we define a general class of models that includes many MIRT models. We discuss the use of penalized joint maximum likelihood to estimate individual models and cross-validation to select the best performing model. This model evaluation process can be optimized using batching techniques, such that even sparse large-scale data can be analyzed efficiently. We illustrate our approach with simulated and real data, including an example from a massive open online course. The high-dimensional model fit to this large and sparse dataset does not lend itself well to traditional methods of factor interpretation. By analogy to recommender-system applications, we propose an alternative “validation” of the factor model, using auxiliary information about the popularity of items consulted during an open-book examination in the course.
Utilizing technology for automated item generation is not a new idea. However, test items used in commercial testing programs or in research are still predominantly written by humans, in most cases by content experts or professional item writers. Human experts are a limited resource and testing agencies incur high costs in the process of continuous renewal of item banks to sustain testing programs. Using algorithms instead holds the promise of providing unlimited resources for this crucial part of assessment development. The approach presented here deviates in several ways from previous attempts to solve this problem. In the past, automatic item generation relied either on generating clones of narrowly defined item types such as those found in language free intelligence tests (e.g., Raven’s progressive matrices) or on an extensive analysis of task components and derivation of schemata to produce items with pre-specified variability that are hoped to have predictable levels of difficulty. It is somewhat unlikely that researchers utilizing these previous approaches would look at the proposed approach with favor; however, recent applications of machine learning show success in solving tasks that seemed impossible for machines not too long ago. The proposed approach uses deep learning to implement probabilistic language models, not unlike what Google brain and Amazon Alexa use for language processing and generation.
A Bayes estimation procedure is introduced that allows the nature and strength of prior beliefs to be easily specified and modal posterior estimates to be obtained as easily as maximum likelihood estimates. The procedure is based on constructing posterior distributions that are formally identical to likelihoods, but are based on sampled data as well as artificial data reflecting prior information. Improvements in performance of modal Bayes procedures relative to maximum likelihood estimation are illustrated for Rasch-type models. Improvements range from modest to dramatic, depending on the model and the number of items being considered.
We propose a framework for identifying discrete behavioural types in experimental data. We re-analyse data from six previous studies of public goods voluntary contribution games. Using hierarchical clustering analysis, we construct a typology of behaviour based on a similarity measure between strategies. We identify four types with distinct stereotypical behaviours, which together account for about 90% of participants. Compared to the previous approaches, our method produces a classification in which different types are more clearly distinguished in terms of strategic behaviour and the resulting economic implications.
Word embeddings are now a vital resource for social science research. However, obtaining high-quality training data for non-English languages can be difficult, and fitting embeddings therein may be computationally expensive. In addition, social scientists typically want to make statistical comparisons and do hypothesis tests on embeddings, yet this is nontrivial with current approaches. We provide three new data resources designed to ameliorate the union of these issues: (1) a new version of fastText model embeddings, (2) a multilanguage “a la carte” (ALC) embedding version of the fastText model, and (3) a multilanguage ALC embedding version of the well-known GloVe model. All three are fit to Wikipedia corpora. These materials are aimed at “low-resource” settings where the analysts lack access to large corpora in their language of interest or to the computational resources required to produce high-quality vector representations. We make these resources available for 40 languages, along with a code pipeline for another 117 languages available from Wikipedia corpora. We extensively validate the materials via reconstruction tests and other proofs-of-concept. We also conduct human crowdworker tests for our embeddings for Arabic, French, (traditional Mandarin) Chinese, Japanese, Korean, Russian, and Spanish. Finally, we offer some advice to practitioners using our resources.
Peatlands, covering approximately one-third of global wetlands, provide various ecological functions but are highly vulnerable to climate change, with their changes in space and time requiring monitoring. The sub-Antarctic Prince Edward Islands (PEIs) are a key conservation area for South Africa, as well as for the preservation of terrestrial ecosystems in the region. Peatlands (mires) found here are threatened by climate change, yet their distribution factors are poorly understood. This study attempted to predict mire distribution on the PEIs using species distribution models (SDMs) employing multiple regression-based and machine-learning models. The random forest model performed best. Key influencing factors were the Normalized Difference Water Index and slope, with low annual mean temperature, with low annual mean temperature, precipitation seasonality and distance from the coast being less influential. Despite moderate predictive ability, the model could only identify general areas of mires, not specific ones. Therefore, this study showed limited support for the use of SDMs in predicting mire distributions on the sub-Antarctic PEIs. It is recommended to refine the criteria used to select environmental factors and enhance the geospatial resolution of the data to improve the predictive accuracy of the models.
Developing large-eddy simulation (LES) wall models for separated flows is challenging. We propose to leverage the significance of separated flow data, for which existing theories are not applicable, and the existing knowledge of wall-bounded flows (such as the law of the wall) along with embedded learning to address this issue. The proposed so-called features-embedded-learning (FEL) wall model comprises two submodels: one for predicting the wall shear stress and another for calculating the eddy viscosity at the first off-wall grid nodes. We train the former using the wall-resolved LES (WRLES) data of the periodic hill flow and the law of the wall. For the latter, we propose a modified mixing length model, with the model coefficient trained using the ensemble Kalman method. The proposed FEL model is assessed using the separated flows with different flow configurations, grid resolutions and Reynolds numbers. Overall good a posteriori performance is observed for predicting the statistics of the recirculation bubble, wall stresses and turbulence characteristics. The statistics of the modelled subgrid-scale (SGS) stresses at the first off-wall grids are compared with those calculated using the WRLES data. The comparison shows that the amplitude and distribution of the SGS stresses and energy transfer obtained using the proposed model agree better with the reference data when compared with the conventional SGS model.
A biofilm refers to an intricate community of microorganisms firmly attached to surfaces and enveloped within a self-generated extracellular matrix. Machine learning (ML) methodologies have been harnessed across diverse facets of biofilm research, encompassing predictions of biofilm formation, identification of pivotal genes and the formulation of novel therapeutic approaches. This investigation undertook a bibliographic analysis focused on ML applications in biofilm research, aiming to present a comprehensive overview of the field’s current status. Our exploration involved searching the Web of Science database for articles incorporating the term “machine learning biofilm,” leading to the identification and analysis of 126 pertinent articles. Our findings indicate a substantial upswing in the publication count concerning ML in biofilm over the last decade, underscoring an escalating interest in deploying ML techniques for biofilm investigations. The analysis further disclosed prevalent research themes, predominantly revolving around biofilm formation, prediction and control. Notably, artificial neural networks and support vector machines emerged as the most frequently employed ML techniques in biofilm research. Overall, our study furnishes valuable insights into prevailing trends and future trajectories within the realm of ML applied to biofilm research. It underscores the significance of collaborative efforts between biofilm researchers and ML experts, advocating for interdisciplinary synergy to propel innovation in this domain.
This study reveals the morphological evolution of a splashing drop by a newly proposed feature extraction method, and a subsequent interpretation of the classification of splashing and non-splashing drops performed by an explainable artificial intelligence (XAI) video classifier. Notably, the values of the weight matrix elements of the XAI that correspond to the extracted features are found to change with the temporal evolution of the drop morphology. We compute the rate of change of the contributions of each frame with respect to the classification value of a video as an importance index to quantify the contributions of the extracted features at different impact times to the classification. Remarkably, the rate computed for the extracted splashing features of ethanol and 1 cSt silicone oil is found to have a peak value at the early impact times, while the extracted features of 5 cSt silicone oil are more obvious at a later time when the lamella is more developed. This study provides an example that clarifies the complex morphological evolution of a splashing drop by interpreting the XAI.
Anticipating future migration trends is instrumental to the development of effective policies to manage the challenges and opportunities that arise from population movements. However, anticipation is challenging. Migration is a complex system, with multifaceted drivers, such as demographic structure, economic disparities, political instability, and climate change. Measurements encompass inherent uncertainties, and the majority of migration theories are either under-specified or hardly actionable. Moreover, approaches for forecasting generally target specific migration flows, and this poses challenges for generalisation.
In this paper, we present the results of a case study to predict Irregular Border Crossings (IBCs) through the Central Mediterranean Route and Asylum requests in Italy. We applied a set of Machine Learning techniques in combination with a suite of traditional data to forecast migration flows. We then applied an ensemble modelling approach for aggregating the results of the different Machine Learning models to improve the modelling prediction capacity.
Our results show the potential of this modelling architecture in producing forecasts of IBCs and Asylum requests over 6 months. The explained variance of our models through a validation set is as high as 80%. This study offers a robust basis for the construction of timely forecasts. In the discussion, we offer a comment on how this approach could benefit migration management in the European Union at various levels of policy making.
Public procurement is a fundamental aspect of public administration. Its vast size makes its oversight and control very challenging, especially in countries where resources for these activities are limited. To support decisions and operations at public procurement oversight agencies, we developed and delivered VigIA, a data-based tool with two main components: (i) machine learning models to detect inefficiencies measured as cost overruns and delivery delays, and (ii) risk indices to detect irregularities in the procurement process. These two components cover complementary aspects of the procurement process, considering both active and passive waste, and help the oversight agencies to prioritize investigations and allocate resources. We show how the models developed shed light on specific features of the contracts to be considered and how their values signal red flags. We also highlight how these values change when the analysis focuses on specific contract types or on information available for early detection. Moreover, the models and indices developed only make use of open data and target variables generated by the procurement processes themselves, making them ideal to support continuous decisions at overseeing agencies.
Focusing on methods for data that are ordered in time, this textbook provides a comprehensive guide to analyzing time series data using modern techniques from data science. It is specifically tailored to economics and finance applications, aiming to provide students with rigorous training. Chapters cover Bayesian approaches, nonparametric smoothing methods, machine learning, and continuous time econometrics. Theoretical and empirical exercises, concise summaries, bolded key terms, and illustrative examples are included throughout to reinforce key concepts and bolster understanding. Ancillary materials include an instructor's manual with solutions and additional exercises, PowerPoint lecture slides, and datasets. With its clear and accessible style, this textbook is an essential tool for advanced undergraduate and graduate students in economics, finance, and statistics.