To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool Python, a new chapter on using Python for statistical analysis, and a new chapter that demonstrates how to use Python within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool R, a new chapter on using R for statistical analysis, and a new chapter that demonstrates how to use R within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
We studied the reconstruction of turbulent flow fields from trajectory data recorded by actively migrating Lagrangian agents. We propose a deep-learning model, track-to-flow (T2F), which employs a vision transformer as the encoder to capture the spatiotemporal features of a single agent trajectory, and a convolutional neural network as the decoder to reconstruct the flow field. To enhance the physical consistency of the T2F model, we further incorporate a physics-informed loss function inspired by the framework of physics-informed neural network (PINN), yielding a variant model referred to as T2F+PINN. We first evaluate both models in a laminar cylinder wake flow at a Reynolds number of $\textit{Re} = 800$ as a proof of concept. The results show that the T2F model achieves velocity reconstruction accuracy comparable to that of existing flow reconstruction methods, while the T2F+PINN model reduces the normalised error in vorticity reconstruction relative to the T2F model. We then apply the models in turbulent Rayleigh–Bénard convection at a Rayleigh number of $Ra = 10^{8}$ and a Prandtl number of $\textit{Pr} = 0.71$. The results show that the T2F model accurately reconstructs both the velocity and temperature fields, whereas the T2F+PINN model further improves the reconstruction accuracy of gradient-related physical quantities, such as temperature gradients, vorticity and the $Q$ value, with a maximum improvement of approximately 60 % compared to the T2F model. Overall, the T2F model is better suited for reconstructing primitive flow variables, while the T2F+PINN model provides advantages in reconstructing gradient-related quantities. Our models open a promising avenue for accurate flow reconstruction from a single Lagrangian trajectory.
Emphasizing how and why machine learning algorithms work, this introductory textbook bridges the gap between the theoretical foundations of machine learning and its practical algorithmic and code-level implementation. Over 85 thorough worked examples, in both Matlab and Python, demonstrate how algorithms are implemented and applied whilst illustrating the end result. Over 75 end-of-chapter problems empower students to develop their own code to implement these algorithms, equipping them with hands-on experience. Matlab coding examples demonstrate how a mathematical idea is converted from equations to code, and provide a jumping off point for students, supported by in-depth coverage of essential mathematics including multivariable calculus, linear algebra, probability and statistics, numerical methods, and optimization. Accompanied online by instructor lecture slides, downloadable Python code and additional appendices, this is an excellent introduction to machine learning for senior undergraduate and graduate students in Engineering and Computer Science.
This work proposes a data-driven explicit algebraic stress-based detached-eddy simulation (DES) method. Despite the widespread use of data-driven methods in model development for both Reynolds-averaged Navier–Stokes (RANS) and large-eddy simulations (LES), their applications to DES remain limited. The challenge mainly lies in the absence of modelled stress data, the requirement for proper length scales in RANS and LES branches, and the maintenance of a reasonable switching behaviour. The data-driven DES method is constructed based on the algebraic stress equation. The control of RANS/LES switching is achieved through the eddy viscosity in the linear part of the modelled stress, under the $\ell ^2-\omega$ DES framework. Three model coefficients associated with the pressure–strain terms and the LES length scale are represented by a neural network as functions of scalar invariants of velocity gradient. The neural network is trained using velocity data with the ensemble Kalman method, thereby circumventing the requirement for modelled stress data. Moreover, the baseline coefficient values are incorporated as additional reference data to ensure reasonable switching behaviour. The proposed approach is evaluated on two challenging turbulent flows, i.e. the secondary flow in a square duct and the separated flow over a bump. The trained model achieves significant improvements in predicting mean flow statistics compared with the baseline model. This is attributed to improved predictions of the modelled stress. The trained model also exhibits reasonable switching behaviour, enlarging the LES region to resolve more turbulent structures. Furthermore, the model shows satisfactory generalization capabilities for both cases in similar flow configurations.
Positive food consumption remains one of the most common challenges among older adults in the UK with at least 10% in community settings and up to 45% in care homes affected by malnutrition. It is strongly associated with frailty, functional and health decline. Tracking and understanding the impact of diet is not easy. There are problems with monitoring diet and malnutrition screening such as difficulty remembering, lack of time, or needing a dietician to interpret the results. Computerised tailored education may be a positive solution to these issues. Due to the rise in smartphone ownership the use of technology to monitor diet is becoming more popular. This review paper will aim to look at the issues with current methods of dietary monitoring particularly in older adults, it will present the benefits and barriers of using to monitor food intake. It will discuss how a photo food monitoring app was developed to address the current issues with technology and how it was tested with older adults living in community and care settings. The prototype was co-developed and incorporated automated food classification to monitor dietary intake and food preferences and tested with older adults. The prototype was usable to both older adults and care workers and feedback on how to improve its use was collected. Key design improvements to make it quicker and more accurate were suggested for future testing in this population. With adaptions this prototype could be beneficial to older adults living in both community and care settings.
Researchers classify political parties into families by their shared cleavage origins. However, as parties have drifted from the original ideological commitments, it is unclear to what extent party families today can function as effective heuristics for shared positions. We propose an alternative way of classifying parties based solely on their ideological positions as one solution to this challenge. We use model‐based clustering to recast common subjective decisions involved in the process of creating party groups as problems of model selection, thus, providing non‐subjective criteria to define ideological clusters. By comparing canonical families to our ideological clusters, we show that while party families on the right are often too similar to justify categorizing them into different clusters, left‐wing families are weakly internally cohesive. Moreover, we identify two clusters predominantly composed of parties in Eastern Europe, questioning the degree to which categories originally designed to describe Western Europe can generalize to other regions.
How does the language of male and female politicians differ when they communicate directly with the public on social media? Do citizens address them differently? We apply Lasso logistic regression models to identify the linguistic features that most differentiate the language used by or addressed to male and female Spanish politicians. Male politicians use more words related to politics, sports, ideology and infrastructure, while female politicians talk about gender and social affairs. The choice of emojis varies greatly across genders. In a novel analysis of tweets written by citizens, we find evidence of gender‐specific insults, and note that mentions of physical appearance and infantilising words are disproportionately found in text addressed to female politicians. The results suggest that politicians conform to gender stereotypes online and reveal ways in which citizens treat politicians differently depending on their gender.
Pater's (2019) target article builds a persuasive case for establishing stronger ties between theoretical linguistics and connectionism (deep learning). This commentary extends his arguments to semantics, focusing in particular on issues of learning, compositionality, and lexical meaning.
This paper charts the rapid rise of data science methodologies in manuscripts published in top journals for third sector scholarship, indicating their growing importance to research in the field. We draw on critical quantitative theory (QuantCrit) to challenge the assumed neutrality of data science insights that are especially prone to misrepresentation and unbalanced treatment of sub-groups (i.e., those marginalized and minoritized because of their race, gender, etc.). We summarize a set of challenges that result in biases within machine learning methods that are increasingly deployed in scientific inquiry. As a means of proactively addressing these concerns, we introduce the “Wells-Du Bois Protocol,” a tool that scholars can use to determine if their research achieves a baseline level of bias mitigation. Ultimately, this work aims to facilitate the diffusion of key insights from the field of QuantCrit by showing how new computational methodologies can be improved by coupling quantitative work with humanistic and reflexive approaches to inquiry. The protocol ultimately aims to help safeguard third sector scholarship from systematic biases that can be introduced through the adoption of machine learning methods.
Media plays a major role in molding US public opinions about Muslims. This paper assesses the effect of 9/11 events on the US media's framing of the Muslim nonprofit sector. Overall it finds that the press was more likely to represent the Muslim nonprofit negatively post 9/11. However, post 9/11, the media framing of Muslim nonprofits was mixed. While the media was more likely to associate Muslim nonprofits and terrorism, they were also more likely to represent Muslim nonprofits as organizations that faced persecution because of Islamophobia, government scrutiny, or hate attacks against them. These media frames may have contributed to public perceptions that Muslim organizations support terrorism while also raising the alarm amongst various stakeholders that the government and the general public are persecuting the Muslim nonprofit sector.
Scholars have discovered remarkable inequalities in who gets represented in electoral democracies. Around the world, the preferences of the rich tend to be better represented than those of the less well‐off. In this paper, we use the most comprehensive comparative dataset of unequal representation available to answer why the poor are underrepresented. By leveraging variation over time and across countries, we study which factors explain why representation is more unequal in some places than in others. We compile a number of covariates examined in previous studies and use machine learning to describe which mechanisms best explain the data. Globally, we find that economic conditions and good governance are most important in determining the extent of unequal representation, and we find little support for hypotheses related to political institutions, interest groups or political behaviour, such as turnout. These results provide the first broadly comparative explanations for unequal representation.
National Taxonomy of Exempt Entities (NTEE) codes have become the primary classifier of nonprofit missions since they were developed in the mid-1980s in response to growing demands for a taxonomy of nonprofit activities (Herman in Nonprofit and Voluntary Sector Quarterly 19(3):293–306, 1990, Barman in Social Science History 37:103–141, 2013). However, the increasingly complex nature of nonprofits means that NTEE codes may be outdated or lack specificity. As an alternative, scholars and practitioners can create a bespoke taxonomy for a specific purpose by hand-coding a training dataset and using machine learning classifiers to apply the codes to a large population. This paper presents a framework for determining training set sizes needed to scale custom taxonomies using machine learning algorithms.
The aim of this study was to determine soil quality index (SQI) for hazelnut gardens managed under organic and conventional agricultural systems. Additionally, the predictability of soil quality was evaluated using the XGBoost algorithm. To determine soil quality, a multi-criteria decision-making process was applied to the total dataset (TDS) using standard scoring functions (linear and non-linear). Additionally, the minimum dataset (MDS) was obtained using principal component analysis (PCA). Then, the model verification process was performed using SQI and yield data. According to the results, although SQI values in conventional agriculture were statistically significantly higher, the correlation between yield and soils under organic agriculture was higher than in conventional agriculture. The SQI averaged 0.4576 in conventionally farmed soils and 0.4417 in organically farmed areas. RMSE values obtained for SQI estimation with the XGBoost algorithm using basic soil properties ranged from 0.038 to 0.065. The mean error rate was approximately 8%. Lin’s concordance correlation coefficients for the SQI estimated by MDS and TDS were 0.60 and 0.61, respectively. The most effective basic soil properties for estimating SQI with the XGBoost algorithm were N, K, OM, and P. It was concluded that the XGBoost algorithm can be evaluated for soil quality prediction. In addition, the spatial distribution patterns of the values predicted by this algorithm and of the observed values were similar. The exclusive use of soil analyses in the study can be considered a limiting factor for the model. More comprehensive studies are planned using reflectance measurements from remote sensing technologies.
Designed for educators, researchers, and policymakers, this insightful book equips readers with practical strategies, critical perspectives, and ethical insights into integrating AI in education. First published in Swedish in 2023, and here translated, updated, and adapted for an English-speaking international audience, it provides a user-friendly guide to the digital and AI-related challenges and opportunities in today's education systems. Drawing upon cutting-edge research, Thomas Nygren outlines how technology can be usefully integrated into education, not as a replacement for humans, but as a tool that supports and reinforces students' learning. Written in accessible language, topics covered include AI literacy, source awareness, and subject-specific opportunities. The central role of the teacher is emphasized throughout, as is the importance of thoughtful engagement with technology. By guiding the reader through the fastevolving digital transformation in education globally, it ultimately enables students to become informed participants in the digital world.
This research paper addresses the hypothesis that sequence-based long short-term memory (LSTM) architectures improve the prediction of the next DO (days open) relative to a feed-forward multi-layer perceptron and a Cox model under strictly temporally valid predictors. Modern dairy farming can heavily benefit from optimising ‘days open’ for profitability and animal welfare. Machine learning can forecast this metric, improving farm management, disease prevention and culling decisions. This study used a dataset of 16,472 breeding records. The study compared the performance of feed-forward neural networks and two types of recurrent neural networks (RNNs). The results showed that LSTM most accurately forecasted the next ‘days open’. This demonstrates that RNN models, due to their ability to capture temporal patterns in the data, significantly outperform feed-forward and traditional statistical methods in terms of mean absolute error and concordance.
Ultra-processed foods (UPFs), defined using frameworks such as NOVA, are increasingly linked to adverse health outcomes, driving interest in ways to identify and monitor their consumption. Artificial intelligence (AI) offers potential, yet it’s application in classifying UPFs remains underexamined. To address this gap, we conducted a scoping review mapping how AI has been used, focusing on techniques, input data, classification frameworks, accuracy, and application. Studies were eligible if peer-reviewed, published in English (2015–2025), and they applied AI approaches to assess or classify UPFs using recognised or study-specific frameworks. A systematic search in May 2025 across PubMed, Scopus, Medline, and CINAHL identified 954 unique records with eight ultimately meeting the inclusion criteria; one additional study was added in October following an updated search after peer review. Records were independently screened and extracted by two reviewers. Extracted data covered AI methods, input types, frameworks, outputs, validation, and context. Studies used diverse techniques, including random forest classifiers, large language models, and rule-based systems, applied across various contexts. Four studies explored practical settings: two assessed consumption or purchasing behaviours, and two developed substitution tools for healthier options. All relied on NOVA or modified versions to categorise processing. Several studies reported predictive accuracy, with F1 scores from 0.86 to 0.98, while another showed alignment between clusters and NOVA categories. Findings highlight the potential of AI tools to improve dietary monitoring and the need for further development of real-time methods and validation to support public health.
Accurately modelling wind turbine wakes is essential for optimising wind farm performance but remains a persistent challenge. While the dynamic wake meandering (DWM) model captures unsteady wake behaviour, it suffers from near-wake inaccuracies due to empirical closures. We propose a symbolic regression-enhanced DWM (SRDWM) framework that achieves equation-level closure by embedding symbolic expressions for volumetric forcing and boundary terms explicitly into governing equations. These physically consistent expressions are discovered from large-eddy simulations (LES) data using symbolic regression guided by a hierarchical, domain-informed decomposition strategy. A revised wake-added turbulence formulation is further introduced to enhance turbulence intensity predictions. Extensive verification across varying inflows shows that SRDWM accurately reproduces both mean wake characteristics and turbulent dynamics, achieving full spatiotemporal resolution with over three orders of magnitude speed-up compared to LES. The results highlight symbolic regression as a bridge between data and physics, enabling interpretable and generalisable modelling.
With the rapid growth of scholarly literature, efficient artificial intelligence (AI)–aided abstract screening tools are becoming increasingly important. This study evaluated 10 different machine learning (ML) algorithms used in AI-aided screening tools for ordering abstracts according to their estimated relevance. We focused on assessing their performance in terms of the number of abstracts required to screen to achieve a sufficient detection rate of relevant articles. Our evaluation included articles screened with diverse inclusion and exclusion criteria. Crucially, we examined how characteristics of the screening data—such as the proportion of relevant articles, the overall frequency of abstracts, and the amount of training data—impacted algorithm effectiveness. Our findings provide valuable insights for researchers across disciplines, highlighting key factors to consider when selecting an ML algorithm and determining a stopping point for AI-aided screening. Specifically, we observed that the algorithm combining the logistic regression (LR) classifier with the sentence-bidirectional encoder representations from transformers (SBERT) feature extractor outperformed other algorithms, demonstrating both the highest efficiency and the lowest variability in performance. Nonetheless, the algorithm’s performance varied across experimental conditions. Building on these findings, we discuss the results and provide practical recommendations to assist users in the AI-aided screening process.
The OPTN draws on a variety of expertise in designing organ allocation rules. Expertise arises from both explicit and tacit knowledge. Explicit knowledge includes generally accepted theories and empirical regularities that are accessible without first-hand experience of practice in some domain of knowledge. Tacit knowledge arises from experience, such as professional practice. In addition to this contributory tacit knowledge, it may also arise through interaction among participants in some domain of knowledge. Through its committee system, the OPTN taps the contributory knowledge of practitioners and patients and creates interactional tacit knowledge, especially among committee staff. Explicit knowledge arises from analysis of near universal longitudinal data on transplant candidates and other data collected within the transplantation system. These data support predictions of policy outcomes through simulation models and optimization tools utilizing machine learning.