To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool Python, a new chapter on using Python for statistical analysis, and a new chapter that demonstrates how to use Python within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool R, a new chapter on using R for statistical analysis, and a new chapter that demonstrates how to use R within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Emphasizing how and why machine learning algorithms work, this introductory textbook bridges the gap between the theoretical foundations of machine learning and its practical algorithmic and code-level implementation. Over 85 thorough worked examples, in both Matlab and Python, demonstrate how algorithms are implemented and applied whilst illustrating the end result. Over 75 end-of-chapter problems empower students to develop their own code to implement these algorithms, equipping them with hands-on experience. Matlab coding examples demonstrate how a mathematical idea is converted from equations to code, and provide a jumping off point for students, supported by in-depth coverage of essential mathematics including multivariable calculus, linear algebra, probability and statistics, numerical methods, and optimization. Accompanied online by instructor lecture slides, downloadable Python code and additional appendices, this is an excellent introduction to machine learning for senior undergraduate and graduate students in Engineering and Computer Science.
The OPTN draws on a variety of expertise in designing organ allocation rules. Expertise arises from both explicit and tacit knowledge. Explicit knowledge includes generally accepted theories and empirical regularities that are accessible without first-hand experience of practice in some domain of knowledge. Tacit knowledge arises from experience, such as professional practice. In addition to this contributory tacit knowledge, it may also arise through interaction among participants in some domain of knowledge. Through its committee system, the OPTN taps the contributory knowledge of practitioners and patients and creates interactional tacit knowledge, especially among committee staff. Explicit knowledge arises from analysis of near universal longitudinal data on transplant candidates and other data collected within the transplantation system. These data support predictions of policy outcomes through simulation models and optimization tools utilizing machine learning.
This computational modelling work investigates whether different rhetorical sections as subgenres of postgraduate English academic texts can be characterised by distinct types and amounts of syntactic structures. A corpus of dissertations written by students with different English language backgrounds and academic contexts was subjected to various Natural Language Processing (NLP) methods. Using a novel analytical method on linguistic data, this study identifies strong syntactic predictors of genres with the robust statistical modelling of ensemble learning. This method consists of four machine learning predictive classifiers of Random Forest, K-Nearest Neighbors, deep learning artificial neural network, and Gradient Boosting as the stacked layer and the Naive Bayes method as the meat-learner. The discussion of findings examines the extent of variability among the rhetorical sections of MA dissertations regarding the type and distribution of coordination, subordination, phrasal complexity, as well as the length of syntactic structures.
We present a critical survey on the consistency of uncertainty quantification used in deep learning and highlight partial uncertainty coverage and many inconsistencies. We then provide a comprehensive and statistically consistent framework for uncertainty quantification in deep learning that accounts for all major sources of uncertainty: input data, training and testing data, neural network weights, and machine-learning model imperfections, targeting regression problems. We systematically quantify each source by applying Bayes’ theorem and conditional probability densities and introduce a fast, practical implementation method. We demonstrate its effectiveness on a simple regression problem and a real-world application: predicting cloud autoconversion rates using a neural network trained on aircraft measurements from the Azores and guided by a two-moment bin model of the stochastic collection equation. In this application, uncertainty from the training and testing data dominates, followed by input data, neural network model, and weight variability. Finally, we highlight the practical advantages of this methodology, showing that explicitly modeling training data uncertainty improves robustness to new inputs that fall outside the training data, and enhances model reliability in real-world scenarios.
Sentiment analysis and stance detection are key tasks in text analysis, with applications ranging from understanding political opinions to tracking policy positions. Recent advances in large language models (LLMs) offer significant potential to enhance sentiment analysis techniques and to evolve them into the more nuanced task of detecting stances expressed toward specific subjects. In this study, we evaluate lexicon-based models, supervised models, and LLMs for stance detection using two corpuses of social media data—a large corpus of tweets posted by members of the U.S. Congress on Twitter and a smaller sample of tweets from general users—which both focus on opinions concerning presidential candidates during the 2020 election. We consider several fine-tuning strategies to improve performance—including cross-target tuning using an assumption of congressmembers’ stance based on party affiliation—and strategies for fine-tuning LLMs, including few shot and chain-of-thought prompting. Our findings demonstrate that: 1) LLMs can distinguish stance on a specific target even when multiple subjects are mentioned, 2) tuning leads to notable improvements over pretrained models, 3) cross-target tuning can provide a viable alternative to in-target tuning in some settings, and 4) complex prompting strategies lead to improvements over pretrained models but underperform tuning approaches.
Social scientists have quickly adopted large language models (LLMs) for their ability to annotate documents without supervised training, an ability known as zero-shot classification. However, due to their computational demands, cost, and often proprietary nature, these models are frequently at odds with open science standards. This article introduces the Political Domain Enhanced BERT-based Algorithm for Textual Entailment (DEBATE) language models: Foundation models for zero-shot, few-shot, and supervised classification of political documents. As zero-shot classifiers, the models are designed to be used for common, well-defined tasks, such as topic and opinion classification. When used in this context, the DEBATE models are not only as good as state-of-the-art LLMs at zero-shot classification, but are orders of magnitude more efficient and completely open source. We further demonstrate that the models are effective few-shot learners. With a simple random sample of 10–25 documents, they can outperform supervised classifiers trained on hundreds or thousands of documents and state-of-the-art generative models. Additionally, we release the PolNLI dataset used to train these models—a corpus of over 200,000 political documents with highly accurate labels across over 800 classification tasks.
Build a firm foundation for studying statistical modelling, data science, and machine learning with this practical introduction to statistics, written with chemical engineers in mind. It introduces a data–model–decision approach to applying statistical methods to real-world chemical engineering challenges, establishes links between statistics, probability, linear algebra, calculus, and optimization, and covers classical and modern topics such as uncertainty quantification, risk modelling, and decision-making under uncertainty. Over 100 worked examples using Matlab and Python demonstrate how to apply theory to practice, with over 70 end-of-chapter problems to reinforce student learning, and key topics are introduced using a modular structure, which supports learning at a range of paces and levels. Requiring only a basic understanding of calculus and linear algebra, this textbook is the ideal introduction for undergraduate students in chemical engineering, and a valuable preparatory text for advanced courses in data science and machine learning with chemical engineering applications.
This article examines the governance challenges of human genomic data sharing. The analysis builds upon the unique characteristics that distinguish genomic data from other forms of personal data, particularly its dual nature as both uniquely identifiable to individuals and inherently collective, reflecting familial and ethnic group characteristics. This duality informs a tripartite risk taxonomy: individual privacy violations, group-level harms, and bioterrorism threats. Examining regulatory frameworks in the European Union (EU) and China, the article demonstrates how current data protection mechanisms—primarily anonymisation and informed consent—prove inadequate for genomic data governance due to the impossibility of true anonymisation and the limitations of consent-based models in addressing the risks of such sharing. Drawing on the concept of “genomic contextualism,” the article proposes a nuanced framework that incorporates interest balancing, comprehensive data lifecycle management, and tailored technical safeguards. The objective is to protect individuals and underrepresented groups while maximising the scientific and clinical value of genomic data.
Machine learning (ML) models show promise in predicting post-traumatic stress disorder (PTSD) treatment outcomes, but it is unknown how their predictions compare to those of clinicians. This study directly compared the accuracy of clinicians’ predictions of patient treatment outcomes with those of three ML models.
Methods
Twenty clinicians providing cognitive processing therapy repeatedly predicted outcomes for 194 veterans. We compared their accuracy against three ML models on two key endpoints: clinically meaningful symptom reduction (≥10-point PCL-5 decrease) and posttreatment severity (final PCL-5 < 33). Clinician predictions were compared against a recurrent neural network, a mixed-effects random forest, and a generalized linear mixed-effects model. We analyzed prediction accuracy and the association between clinician confidence and accuracy using logistic mixed-effects models.
Results
ML models were significantly more accurate than clinicians at predicting whether a patient’s posttreatment PCL-5 score would be below 33 (p < .001). However, no significant difference in accuracy was found for predicting a ≥10-point symptom reduction (p = .734). Clinician confidence increased throughout treatment and was significantly associated with greater prediction accuracy for both outcomes (ORs = 1.06, ps < .001).
Conclusions
ML models can outperform clinicians in predicting posttreatment symptom severity, particularly early in treatment, suggesting they could be a useful tool for identifying patients at risk for suboptimal outcomes. However, ML models were not superior in predicting symptom reduction, where clinicians also performed at a high level. Findings support the selective use of ML to enhance, rather than replace, clinical judgment in PTSD treatment.
Recently, data-driven methods have shown great promise for discovering governing equations from simulation or experimental data. However, most existing approaches are limited to scalar equations, with few capable of identifying tensor relationships. In this work, we propose a general data-driven framework for identifying tensor equations, referred to as symbolic identification of tensor equations (SITE). The core idea of SITE – representing tensor equations using a host–plasmid structure – is inspired by the multidimensional gene expression programming approach. To improve the robustness of the evolutionary process, SITE adopts a genetic information retention strategy. Moreover, SITE introduces two key innovations beyond conventional evolutionary algorithms. First, it incorporates a dimensional homogeneity check to restrict the search space and eliminate physically invalid expressions. Second, it replaces traditional linear scaling with a tensor linear regression technique, greatly enhancing the efficiency of numerical coefficient optimization. We validate SITE using two benchmark scenarios, where it accurately recovers target equations from synthetic data, showing robustness to noise and flexible expressive capability. Furthermore, SITE is applied to identify constitutive relations directly from molecular simulation data, which are generated without reliance on macroscopic constitutive models. It adapts to both compressible and incompressible flow conditions and successfully identifies the corresponding macroscopic forms, highlighting its potential for data-driven discovery of tensor equation.
This paper aims to elucidate the physical mechanisms underlying airfoil–vortex gust interaction and mitigation. The vortex gust mitigation problem consists in finding the pitch rate sequence that minimises the gust-induced lift disturbance of an NACA0012 airfoil at Reynolds number 1000. The instantaneous flow fields and resulting lift are obtained from numerical resolution of the Navier–Stokes equations. The controller is modelled as an artificial neural network and trained to minimise the lift fluctuation using deep reinforcement learning (DRL). The paper shows that DRL-trained controllers are able to mitigate medium- and high-intensity vortex gusts by more than 80 % compared to the uncontrolled scenario. It then presents a comparative analysis of the controlled and uncontrolled lift generation mechanisms using the force partitioning method (FPM). The FPM provides a quantitative assessment of the amount of lift generated by each flow region. For medium-intensity gusts, the main phenomenon is the asymmetry in the airfoil boundary layer induced by the vortex. The control strategy mitigates the gust-induced lift by restoring the flow symmetry around the airfoil. For high-intensity gusts, the boundary layer asymmetry remains, but the gust interaction with the airfoil also triggers flow separation and the formation of a strong leading-edge vortex (LEV). Consequently, the control command balances several aerodynamic phenomena such as boundary layer asymmetry, flow detachment, LEV, and secondary recirculation regions to produce a net quasi-zero lift fluctuation. Thus this work highlights the potential of DRL control, enhanced by advanced post-processing such as FPM, to discover and interpret optimal flow control mechanisms.
Eating disorders, particularly anorexia nervosa and bulimia nervosa, are significant global health challenges.
Aims
This study analyses historical trends and forecasts future patterns of eating disorders among young adults aged 15–29 years using machine learning techniques.
Method
Global data on anorexia nervosa and bulimia nervosa from the Global Burden of Disease study 2021 spanning 1990 to 2021 were analysed, examining incidence, prevalence and disability-adjusted life years (DALYs) across age groups, sociodemographic index (SDI) levels and regions. Eight machine-learning models were employed to forecast trends from 2022 to 2050.
Results
Bulimia nervosa showed more pronounced increases compared to anorexia nervosa across all metrics. The 15–19 age group had the highest incidence rates, while the 20–24 age group showed the highest prevalence and DALY rates. Low SDI regions experienced substantial increases, with bulimia nervosa prevalence rising by 179.05%. East Asia demonstrated the most significant rise in age-standardised rates. The Prophet model best forecast anorexia nervosa trends, while ARIMA performed best for bulimia nervosa. Projections indicate continued increases through 2050 for both disorders.
Conclusions
The global burden of eating disorders among young adults is projected to increase significantly by 2050, with bulimia nervosa showing more rapid growth than anorexia nervosa. Substantial variations exist across age groups, SDI levels and regions. These findings highlight the urgent need for enhanced prevention programmes targeting high-risk age groups, strengthened healthcare capacity in rapidly developing regions and evidence-based policy interventions to address the growing global burden of eating disorders.
In this work we propose a neural operator-based coloured-in-time forcing model to predict space–time characteristics of large-scale turbulent structures in channel flows. The resolvent-based method has emerged as a powerful tool to capture dominant dynamics and associated spatial structures of turbulent flows. However, the method faces the difficulty in modelling the coloured-in-time nonlinear forcing, which often leads to large predictive discrepancies in the frequency spectra of velocity fluctuations. Although the eddy viscosity has been introduced to enhance the resolvent-based method by partially accounting for the forcing colour, it is still not able to accurately capture the decay rate of the time-correlation function. Also, the uncertainty in the modelled eddy viscosity can significantly limit the predictive reliability of the method. In view of these difficulties, we propose using the neural operator based on the DeepONet architecture to model the stochastic forcing as a function of mean velocity and eddy viscosity. Specifically, the DeepONet-based model is constructed to map an arbitrary eddy-viscosity profile and corresponding mean velocity to stochastic forcing spectra based on the direct numerical simulation data at $Re_\tau =180$. Furthermore, the learned forcing model is integrated with the resolvent operator, which enables predicting the space–time flow statistics based on the eddy viscosity and mean velocity from the Reynolds-averaged Navier–Stokes (RANS) method. Our results show that the proposed forcing model can accurately predict the frequency spectra of velocity in channel flows at different characteristic scales. Moreover, the model remains robust across different RANS-provided eddy viscosities and generalises well to $Re_\tau =550$.
This study explores the marriage matching of only-child individuals and the related outcomes. Specifically, we analyze two aspects: First, we investigate the marriage patterns of only children, examining whether people choose mates in a positive or negative assortative manner regarding only-child status. We find that, along with being more likely to remain single, only children are more likely to marry another only child. Second, we measure the matching premium or penalty as the difference in partners’ socioeconomic status between only-child and non-only-child individuals, where socioeconomic status is approximated by years of schooling. Our estimates indicate that among women who marry an only-child husband, only children are penalized, as their partners’ educational attainment is 0.63 years lower. Finally, we discuss the potential sources of this penalty in light of our empirical findings.
Chapter 10 predicts the “future” of chilling effects – which today looks darker and more dystopian than ever in light of the proliferation of new forms of artificial intelligence, machine learning, and automation technologies in society. The author here introduces a new term “superveillance” to explain new forms of AI-driven systems of automated legal and social norm enforcement that will likely cause mass societal chilling effects at an unprecedented scale. The author also argues how chilling effects today enable this more oppressive future and proposes a comprehensive law and public policy reforms and solutions to stop it.
The intelligible world of machines and predictive modelling is an omnipresent and almost inescapable phenomenon. It is an evolution where human intelligence is being supported, supplemented or superseded by artificial intelligence (AI). Decisions once made by humans are now made by machines, learning at a faster and more accurate rate through algorithmic calculations. Jurisprudent academia has undertaken to argue the proposition of AI and its role as a decision-making mechanism in Australian criminal jurisdictions. This paper explores this proposition through predictive modelling of 101 bail decisions made in three criminal courts in the State of New South Wales (NSW), Australia. Indicatively, the models’ statistical performance and accuracy, based on nine predictor variables, proved effective. The more accurate logistic regression model achieved 78% accuracy and a performance value of 0.845 (area under the curve; AUC), while the classifier model achieved 72.5% accuracy and a performance value of 0.702 (AUC). These results provide the groundwork for AI-generated bail decisions being piloted in the NSW jurisdiction and possibly others within Australia.
In this article, a 1 × 2 bandwidth (BW) and frequency-reconfigurable dielectric resonator-based multiple input multiple output (MIMO) antenna array is presented for 5G sub-6 GHz (3.3–6.0 GHz)/Wi-Fi 6E (5.925–6.425 GHz)/Wi-Fi 5G (5.15–5.85 GHz) applications. Additional dual-ring-open loop resonator structures with varied dimensions are introduced within antenna’s feeding network to achieve BW and frequency reconfigurability. RF PIN and varactor diodes (VDs) are integrated with proposed structure to enable switching between various modes and continuous tuning of frequency and BW, respectively. Further, Taguchi neural network (TNN) has been incorporated to predict percentage bandwidth of proposed antenna, getting a maximum deviation of only 0.6% from actual value. The proposed structure operated from 4.98 to 6.5 GHz, achieving wide continuous frequency tuning of 20.36% in passband and 6.1% reconfiguration for notch band. It also demonstrates continuous BW tunability from 16.69% to 34.44% with measured BWs of 19.58%, 34.44%, and 16.69% at 0, 3, and 8 V reverse bias voltages of VDs, respectively. MIMO antenna array structure also shows enhanced gain performance with a peak gain of 11.03 dBi and an overall gain above 7 dBi in the whole operating band.
Psychotic-like experiences (PLEs) are considered a subclinical component of psychosis continuum. Studies indicate that PLEs arise from multimodal factors, yet research comprehensively examining these factors together remains scarce. Using a large youth sample, we present the first model that simultaneously examines multimodal factors related to PLEs. As a secondary aim, we evaluate the model’s ability to explain psychosis in an external validation cohort that included individuals experiencing psychosis.
Methods
After applying variable selection including generalized estimating equations, correlation filtering, Least Absolute Shrinkage and Selection Operator model to 741 variables (i.e., environmental factors, cognitive appraisals, clinical variables, cognitive functioning, and structural brain connectome measures), obtained PLEs predictors (N = 27) and covariates (i.e., age, sex, IQ) were included in the classification model based on Elastic Net algorithm for predicting high/low PLEs in 396 healthy participants aged 14–24 (Mage = 19.72 ± 2.5). We externally validated PLE-related predictors in a clinical sample comprising first-episode psychosis patients (n = 19), their siblings (n = 20), and healthy controls (n = 19).
Results
Eleven factors, including environmental and cognitive appraisals, along with 16 structural network properties spanning frontal, temporal, occipital, and parietal regions, were identified as important predictors of PLEs. The model’s performance was moderate in predicting low versus high PLEs (accuracy = 75%, AUC = 0.750). Specificity was high (84.2%) in distinguishing siblings from patients.
Conclusions
Multimodal features, including environmental burden, cognitive schemas, and brain network alterations, predict PLEs and partially generalize to clinical psychosis. These variables may reflect intermediate phenotypes across the psychosis spectrum, offering insights into both vulnerability and resilience.