To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool Python, a new chapter on using Python for statistical analysis, and a new chapter that demonstrates how to use Python within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
Students will develop a practical understanding of data science with this hands-on textbook for introductory courses. This new edition is fully revised and updated, with numerous exercises and examples in the popular data science tool R, a new chapter on using R for statistical analysis, and a new chapter that demonstrates how to use R within a range of cloud platforms. The many practice examples, drawn from real-life applications, range from small to big data and come to life in a new end-to-end project in Chapter 11. New 'Data Science in Practice' boxes highlight how concepts introduced work within an industry context and many chapters include new sections on AI and Generative AI. A suite of online material for instructors provides a strong supplement to the book, including lecture slides, solutions, additional assessment material and curriculum suggestions. Datasets and code are available for students online. This entry-level textbook is ideal for readers from a range of disciplines wishing to build a practical, working knowledge of data science.
In this study, a classifier (hyperplane) is determined to distinguish the neural responses during emotion regulation versus viewing images in healthy adults and then applied to determine (i) the effectiveness of the emotion regulation response (defined as emotion regulation distance from the hyperplane [DFHER]) in independent samples of healthy adults, patients with BD, and the patients’ unaffected relatives (URs) and (ii) the association of DFHER with the duration of future (hypo)manic and depressive episodes for patients with BD over a 16-month follow-up period.
Methods
Study participants (N = 226) included 65 healthy adults (35 used for support vector machine [SVM] learning [HCTrain] and 30 kept as an independent test sample [HCTest]), 87 patients with newly diagnosed BD (67% BD type 2) and 74 URs. BOLD response data came from an emotion regulation task. Clinical symptoms were assessed at baseline fMRI and after 16 months of specialized treatment.
Results
The SVM ML analysis identified a hyperplane with 75.7% accuracy. Patients with BD showed reduced DFHER relative to the HCTest and UR groups. Reduced DFHER was associated with reduced improvement in psychosocial functioning during the 16-month follow-up time (B = −1.663, p = 0.02).
Conclusions
The neural response during emotion regulation can be relatively well distinguished in healthy adults via ML. Patients with newly diagnosed BD show significant disruption in the recruitment of this emotion regulation response. Disrupted may indicate a reduced capacity for functional improvement during specialized treatment in a mood disorder clinic.
Between 2023 and 2024, the Endangered Archaeology in the Middle East and North Africa (EAMENA) project, in collaboration with the Libyan Department of Antiquities (DoA), organised and conducted a series of training workshops and fieldwork campaigns in Libya, funded by the British Council’s Cultural Protection Fund (CPF). The workshops provided training to over 20 members of the DoA in a newly-developed Machine Learning Automated Change Detection (MLACD) tool. This remote sensing method was developed by the Leicester EAMENA team to detect landscape change and aid heritage monitoring efforts. The MLACD method was applied to four case studies in Libya: Lefakat (Cyrenaica), Bani Walid (Tripolitania), the region south of Derna (Cyrenaica) and Jarma (Fazzan). Each of these case studies was followed by a survey campaign by Libyan archaeologists to validate the results of the method, survey the archaeological sites identified, record their condition and assess the disturbances and threats affecting them. This article will provide an overview of the aims and successful outcomes of the EAMENA-CPF training programme, as well as an introduction to the MLACD method and its application to Libyan heritage, providing background and context for the individual case studies, which will be published more fully in separate articles.
Early identification of risk for attention-deficit hyperactivity disorder (ADHD) symptoms can enable more timely interventions and improve long-term outcomes. While previous research has linked various maternal and perinatal factors to ADHD, few studies have examined these predictors collectively in a single comprehensive analysis. This study aimed to assess whether later ADHD symptoms can be predicted from information available at birth, specifically ethnicity, maternal metabolic markers, mental health, and socioeconomic status. It additionally aimed to identify the most influential predictors. Using data from the Born in Bradford (BiB) study, we applied multiple linear regression (LR) and machine learning techniques to predict ADHD symptoms as measured by the Hyperactivity/Inattention subscale of the Strengths and Difficulties Questionnaire (SDQ). A 10-fold cross-validated LR model explained 6.97% of the variance in SDQ scores. In the random forest model, infant male sex and maternal smoking during pregnancy emerged as the top predictors. These findings provide proof of principle for early identification of children at risk of ADHD. Future models may benefit from incorporating additional perinatal data to improve predictive accuracy.
Trustworthy volumetric flow measurements are essential in many applications such as power plant controls or district heating systems. Flow metering under disturbed flow conditions, such as downstream of bends, is a challenge and leads to errors of up to 20 %. In this paper, an algorithm based on a shallow neural network (SNN) is developed, leading to a significant error reduction for strongly disturbed flow profiles. To cover a wide range of disturbances, the training dataset was chosen to consist of three base types of elbow configurations. For 83 % of the test data, the SNN produces a smaller error than the state-of-the-art approach. The average error is reduced from 2.25 % to 0.42 %. For the SNN, an error of less than 1 % can be achieved for downstream distances greater than 10 pipe diameters. The SNN demonstrated robustness to various reductions of the training dataset, as well as to noisy input data. Additionally, simulation data of a realistic pipe system with a significantly different geometry compared with the training data was used for testing. In this strong extrapolation, the mean error of the SNN was always smaller than the state-of-the-art approach and an error of less than 1 % could be achieved for more than 10 pipe diameters downstream of the last disturbance.
Bridge the gap between theoretical concepts and their practical applications with this rigorous introduction to the mathematics underpinning data science. It covers essential topics in linear algebra, calculus and optimization, and probability and statistics, demonstrating their relevance in the context of data analysis. Key application topics include clustering, regression, classification, dimensionality reduction, network analysis, and neural networks. What sets this text apart is its focus on hands-on learning. Each chapter combines mathematical insights with practical examples, using Python to implement algorithms and solve problems. Self-assessment quizzes, warm-up exercises and theoretical problems foster both mathematical understanding and computational skills. Designed for advanced undergraduate students and beginning graduate students, this textbook serves as both an invitation to data science for mathematics majors and as a deeper excursion into mathematics for data science students.
In this study, a metasurface (MS) polarization converter combined with a two-port dielectric antenna is constructed and studied. The feeding configuration, which consists of a printed line connected to an aperture, offers built-in filtering capabilities. In addition to converting linear to circular polarization between 2.49 and 3.25 GHz, the suspended MS layer enhances port isolation to less than −20 dB. In addition, the suggested radiator’s |S11| is projected using the Random Forest and XGBoost machine learning (ML) models, which demonstrate satisfactory agreement with simulation data. The antenna effectively functions over 2.33–3.35 GHz, demonstrating that it is a leading contender for sub-6 GHz 5G communication systems. Fabricated measurements support both simulation and ML predictions.
High-resolution particle image velocimetry (PIV) particle-to-velocity analyses using small interrogation areas (IAs) often require substantial processing time. To overcome this limitation, a generative adversarial network (GAN)-based model is proposed to achieve spatio-temporal super-resolution (SR) reconstruction from low-resolution PIV data with large IAs, thereby significantly reducing post-processing time. Time-resolved PIV measurements of plasma-induced vortex flows, covering vortex formation, growth, transition and breakdown stages, are employed to train and evaluate the model with multi-scale vortical structures. By sequentially constructing spatial and temporal datasets, the GAN-based model enables reliable SR reconstruction at different scaling factors. Reconstruction accuracy is systematically assessed using time-averaged, instantaneous and phase-averaged velocity fields. At SR factors of $\times$4 and $\times$8, the reconstructed fields closely match high-resolution references, effectively capturing both fluctuating velocities and small-scale vortical structures. In contrast, $\times$16 reconstructions exhibit diminished accuracy due to the loss of fine-scale information from highly downsampled inputs. For time-averaged fields, high-resolution reconstructions reliably capture plasma jet characteristics at all SR factors. To enhance generalisation, transfer learning is introduced to fine tune the parameters of SR-related layers in the generator, enabling accurate reconstructions under varying vortex dynamics. In addition, the efficiency gains in PIV particle-to-velocity analysis and the fundamental limitations on achievable SR factors imposed by spatio-temporal data correlations are discussed. This study demonstrates that GAN-based spatio-temporal SR models offer a promising approach to accelerate PIV analyses while maintaining high reconstruction fidelity with diverse flow conditions.
Adolescence marks a critical period for the onset of anxiety disorders, yet they frequently remain undiagnosed due to barriers such as reluctance to self-disclose symptoms. Objective screening methods that bypass self-report may improve early detection. Speech-derived acoustic markers have emerged as a promising avenue for identifying anxiety disorders. This study investigates associations between acoustic properties of speech, anxiety severity, and anxiety diagnoses in adolescents, evaluated cross-sectionally and longitudinally.
Methods
Speech samples from 581 adolescents were collected during the Trier Social Stress Test. Acoustic features were extracted using OpenSMILE and analyzed for cross-sectional associations with anxiety severity (Spearman’s correlations) and longitudinal predictions of future anxiety (linear regressions). Random forest (RF) classifiers with 10-fold cross-validation were used to classify anxious and healthy individuals using acoustic features. Analyses were stratified by sex.
Results
RFs achieved the highest performance for the longitudinal classification of social anxiety disorder (SAD), with an AUC-ROC of 85% (males) and 74% (females). Adding acoustic features to baseline measures increased the variance explained in anxiety by 5.4% (males) and 10.9% (females). In males, higher anxiety was cross-sectionally correlated with reduced pitch slope, narrower pitch range, lower F1 frequency, and greater MFCC1 variability. Females with higher anxiety showed reduced variability in pitch slope. Correlations did not survive multiple testing correction.
Conclusions
Acoustic speech markers elicited in socially evaluative contexts can accurately recognize SAD in male adolescents three years in advance. Performance is moderate for females and other anxiety disorders, underscoring the need for sex-specific approaches to diagnostic tool development.
Background. Despite the growing recognition of adolescent suicide as a pressing concern, traditional methods for identifying suicide risk often fail to capture the complex interplay of socio-ecological and psychological factors. The advent of machine learning (ML) offers a transformative opportunity to improve suicide risk prediction and intervention strategies. Objective. This study aims to utilize ML techniques to analyze socio-ecological and psychological risk factors to predict suicide ideation, plans and attempts among a nationally representative sample of Ghanaian adolescents. Methods. A cross-sectional survey was conducted with 1,703 adolescents aged 12–18 years across Ghana measuring psychological factors (depression symptoms, anxiety symptoms etc) and socio-ecological factors (bullying, parental support etc) using validated measures. Descriptive statistics were conducted and random forest and logistic regression models were employed for suicide risk prediction, i.e., ‘ideation, plans and attempts’. Model performance was evaluated using accuracy, sensitivity, specificity and feature importance analysis. Results. Psychological factors such as depression symptoms (r = .42, p < .01), anxiety (r = .38, p < .01) and perceived stress (r = .35, p < .01) were the strongest predictors of suicide ideation, plans and attempts, while parental support emerged as a significant protective factor (r = −.34, p < .01). The random forest model demonstrated good predictive performance (accuracy = 78.3%, AUC = 0.81). Gender differences were observed. Conclusions. This study is the first to apply ML techniques to a nationally representative dataset of Ghanaian adolescents for suicide risk prediction, i.e., ‘ideation, plans and attempts’. The findings highlight the potential of ML to provide precise tools for early identification of at-risk individuals.
This chapter describes the important role of artificial intelligence (AI) in Big Data psychology research. First, we discuss the main goals of AI, and then delve into an example of machine learning and what is happening under the hood. The chapter then describes the Perceptron, a classic simple neural network, and how this has grown into deep learning AI which has become increasingly popular in recent years. Deep learning can be used both for prediction and generation, and has a multitude of applications for psychology and neuroscience. This chapter concludes with the ethical quandaries around fake data generated by AI and biases that exist in how we train systems, as well as some exciting clinical applications of AI relevant to psychology and neuroscience.
Transonic buffet presents time-dependent aerodynamic characteristics associated with shock, turbulent boundary layer and their interactions. Despite strong nonlinearities and a large degree of freedom, there exists a dominant dynamic pattern of a buffet cycle, suggesting the low dimensionality of transonic buffet phenomena. This study seeks a low-dimensional representation of transonic airfoil buffet at a high Reynolds number with machine learning. Wall-modelled large-eddy simulations of flow over the OAT15A supercritical airfoil at two Mach numbers, $M_\infty = 0.715$ and 0.730, respectively producing non-buffet and buffet conditions, at a chord-based Reynolds number of ${Re} = 3\times 10^6$ are performed to generate the present datasets. We find that the low-dimensional nature of transonic airfoil buffet can be extracted as a sole three-dimensional latent representation through lift-augmented autoencoder compression. The current low-order representation not only describes the shock movement but also captures the moment when the separation occurs near the trailing edge in a low-order manner. We further show that it is possible to perform sensor-based reconstruction through the present low-dimensional expression while identifying the sensitivity with respect to aerodynamic responses. The present model trained at ${Re} = 3\times 10^6$ is lastly evaluated at the level of a real aircraft operation of ${Re} = 3\times 10^7$, exhibiting that the phase dynamics of lift is reasonably estimated from sparse sensors. The current study may provide a foundation towards data-driven real-time analysis of transonic buffet conditions under aircraft operation.
With ambitious action required to achieve global climate mitigation goals, climate change has become increasingly salient in the political arena. This article presents a dataset of climate change salience in 1,792 political manifestos of 620 political parties across different party families in forty-five OECD, European, and South American countries from 1990 to 2022. Importantly, our measure uniquely isolates climate change salience, avoiding the conflation with general environmental and sustainability content found in other work. Exploiting recent advances in supervised machine learning, we developed the dataset by fine-tuning a pre-trained multilingual transformer with human coding, employing a resource-efficient and replicable pipeline for multilingual text classification that can serve as a template for similar tasks. The dataset unlocks new avenues of research on the political discourse of climate change, on the role of parties in climate policy making, and on the political economy of climate change. We make the model and the dataset available to the research community.
The epidemiology and age-specific patterns of lifetime suicide attempts (LSA) in China remain unclear. We aimed to examine age-specific prevalence and predictors of LSA among Chinese adults using machine learning (ML).
Methods
We analyzed 25,047 adults in the 2024 Psychology and Behavior Investigation of Chinese Residents (PBICR-2024), stratified into three age groups (18–24, 25–44, ≥ 45 years). Thirty-seven candidate predictors across six domains—sociodemographic, physical health, mental health, lifestyle, social environment, and self-injury/suicide history—were assessed. Five ML models—random forest, logistic regression, support vector machine (SVM), Extreme Gradient Boosting (XGBoost), and Naive Bayes—were compared. SHapley Additive exPlanations (SHAP) were used to quantify feature importance.
Results
The overall prevalence of LSA was 4.57% (1,145/25,047), with significant age differences: 8.10% in young adults (18–24), 4.67% in adults aged 25–44, and 2.67% in older adults (≥45). SVM achieved the best test-set performance across all ages [area under the curve (AUC) 0.88–0.94, sensitivity 0.79–0.87, specificity 0.81–0.88], showing superior calibration and net clinical benefit. SHAP analysis identified both shared and age-specific predictors. Suicidal ideation, adverse childhood experiences, and suicide disclosure were consistent top predictors across all ages. Sleep disturbances and anxiety symptoms stood out in young adults; marital status, living alone, and perceived stress in mid-life; and functional limitations, poor sleep, and depressive symptoms in older adults.
Conclusions
LSA prevalence in Chinese adults is relatively high, with a clear age gradient peaking in young adulthood. Risk profiles revealed both shared and age-specific predictors, reflecting distinct life-stage vulnerabilities. These findings support age-tailored suicide prevention strategies in China.
The simulation of turbulent flow requires many degrees of freedom to resolve all the relevant time and length scales. However, due to the dissipative nature of the Navier–Stokes equations, the long-term dynamics is expected to lie on a finite-dimensional invariant manifold with fewer degrees of freedom. In this study, we build low-dimensional data-driven models of pressure-driven flow through a circular pipe. We impose the ‘shift-and-reflect’ symmetry to study the system in a minimal computational cell (e.g. the smallest domain size that sustains turbulence) at a Reynolds number of 2500. We build these models by using autoencoders to parametrise the manifold coordinates and neural ordinary differential equation to describe their time evolution. Direct numerical simulations (DNSs) typically require of the order of $\mathcal{O}(10^5)$ degrees of freedom, while our data-driven framework enables the construction of models with fewer than 20 degrees of freedom. Remarkably, these reduced-order models effectively capture crucial features of the flow, including the streak breakdown. In short-time tracking, these models accurately track the true trajectory for one Lyapunov time, as well as the leading Lyapunov exponent, while at long-times, they successfully capture key aspects of the dynamics such as Reynolds stresses and energy balance. The model can quantitatively capture key characteristics of the flow, including the streak breakdown and regeneration cycle. Additionally, we report new exact coherent states found in the DNS with the aid of these low-dimensional models. This approach leads to the discovery of seventeen previously unknown solutions within the turbulent pipe flow system, notably featuring relative periodic orbits characterised by the longest reported periods for such flow conditions.
With the growing amount of historical infrastructure data available to engineers, data-driven techniques have been increasingly employed to forecast infrastructure performance. In addition to algorithm selection, data preprocessing strategies for machine learning implementations plays an equally important role in ensuring accuracy and reliability. The present study focuses on pavement infrastructure and identifies four categories of strategies to preprocess data for training machine-learning-based forecasting models. The Long-Term Pavement Performance (LTPP) dataset is employed to benchmark these categories. Employing random forest as the machine learning algorithm, the comparative study examines the impact of data preprocessing strategies, the volume of historical data, and forecast horizon on the accuracy and reliability of performance forecasts. The strengths and limitations of each implementation strategy are summarized. Multiple pavement performance indicators are also analysed to assess the generalizability of the findings. Based on the results, several findings and recommendations are provided for short-to medium-term infrastructure management and decision-making: (i) in data-scarce scenarios, strategies that incorporate both explanatory variables and historical performance data provides better accuracy and reliability, (ii) to achieve accurate forecasts, the volume of historical data should at least span a time duration comparable to the intended forecast horizon, and (iii) for International Roughness Index and transverse crack length, a forecast horizon up to 5 years is generally achievable, but forecasts beyond a three-year horizon are not recommended for longitudinal crack length. These quantitative guidelines ultimately support more effective and reliable application of data-driven techniques in infrastructure performance forecasting.
This study aimed to evaluate an artificial intelligence-assisted tool for psychiatric case formulation compared with human trainees. Twenty trainees and an artificial intelligence system produced formulations for three simulated psychiatric cases. Formulations were scored using the integrated case formulation scale (ICFS), assessing content, integration and total quality. Time taken was recorded, and assessor predictions of formulation origin were analysed.
Results
Artificial intelligence produced formulations significantly faster (<10 s) than trainees (mean 52.1 min). Trainees achieved higher ICFS total scores (mean difference 8.3, P < 0.001), driven by superior content scores, while integration scores were comparable. The assessor identified artificial intelligence-generated formulations with 71.4% sensitivity, but overall accuracy of who produced the formulations was only 58.3%.
Clinical implications
Artificial intelligence shows promise as a time-saving adjunct in psychiatric training and practice, but requires improvements in generating detailed content. Optimising teaching methods for trainees and refining artificial intelligence systems can enhance the integration of artificial intelligence into clinical workflows.