We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This study introduces an advanced reinforcement learning (RL)-based control strategy for heating, ventilation, and air conditioning (HVAC) systems, employing a soft actor-critic agent with a customized reward mechanism. This strategy integrates time-varying outdoor temperature-dependent weighting factors to dynamically balance thermal comfort and energy efficiency. Our methodology has undergone rigorous evaluation across two distinct test cases within the building optimization testing (BOPTEST) framework, an open-source virtual simulator equipped with standardized key performance indicators (KPIs) for performance assessment. Each test case is strategically selected to represent distinct building typologies, climatic conditions, and HVAC system complexities, ensuring a thorough evaluation of our method across diverse settings. The first test case is a heating-focused scenario in a residential setting. Here, we directly compare our method against four advanced control strategies: an optimized rule-based controller inherently provided by BOPTEST, two sophisticated RL-based strategies leveraging BOPTEST’s KPIs as reward references, and a model predictive control (MPC)-based approach specifically tailored for the test case. Our results indicate that our approach outperforms the rule-based and other RL-based strategies and achieves outcomes comparable to the MPC-based controller. The second scenario, a cooling-dominated environment in an office setting, further validates the versatility of our strategy under varying conditions. The consistent performance of our strategy across both scenarios underscores its potential as a robust tool for smart building management, adaptable to both residential and office environments under different climatic challenges.
Expert drivers possess the ability to execute high sideslip angle maneuvers, commonly known as drifting, during racing to navigate sharp corners and execute rapid turns. However, existing model-based controllers encounter challenges in handling the highly nonlinear dynamics associated with drifting along general paths. While reinforcement learning-based methods alleviate the reliance on explicit vehicle models, training a policy directly for autonomous drifting remains difficult due to multiple objectives. In this paper, we propose a control framework for autonomous drifting in the general case, based on curriculum reinforcement learning. The framework empowers the vehicle to follow paths with varying curvature at high speeds, while executing drifting maneuvers during sharp corners. Specifically, we consider the vehicle’s dynamics to decompose the overall task and employ curriculum learning to break down the training process into three stages of increasing complexity. Additionally, to enhance the generalization ability of the learned policies, we introduce randomization into sensor observation noise, actuator action noise, and physical parameters. The proposed framework is validated using the CARLA simulator, encompassing various vehicle types and parameters. Experimental results demonstrate the effectiveness and efficiency of our framework in achieving autonomous drifting along general paths. The code is available at https://github.com/BIT-KaiYu/drifting.
In this note we provide an upper bound for the difference between the value function of a distributionally robust Markov decision problem and the value function of a non-robust Markov decision problem, where the ambiguity set of probability kernels of the distributionally robust Markov decision process is described by a Wasserstein ball around some reference kernel whereas the non-robust Markov decision process behaves according to a fixed probability kernel contained in the ambiguity set. Our derived upper bound for the difference between the value functions is dimension-free and depends linearly on the radius of the Wasserstein ball.
The flexible flat cable (FFC) assembly task is a prime challenge in electronic manufacturing. Its characteristics of being prone to deformation under external force, tiny assembly tolerance, and fragility impede the application of robotic assembly in this field. To achieve reliable and stable robotic automation assembly of FFC, an efficient assembly skill acquisition strategy is presented by combining a parallel robot skill learning algorithm with adaptive impedance control. The parallel robot skill learning algorithm is proposed to enhance the efficiency of FFC assembly skill acquisition, which reduces the risk of damaging FFC and tackles the uncertain influence resulting from deformation during the assembly process. Moreover, FFC assembly is also a complex contact-rich manipulation task. An adaptive impedance controller is designed to implement force tracking during the assembly process without precise environment information, and the stability is also analyzed based on the Lyapunov function. Experiments of FFC assembly are conducted to illustrate the efficiency of the proposed method. The experimental results demonstrate that the proposed method is robust and efficient.
One in eight children experience early life stress (ELS), which increases risk for psychopathology. ELS, particularly neglect, has been associated with reduced responsivity to reward. However, little work has investigated the computational specifics of this disrupted reward response – particularly with respect to the neural response to Reward Prediction Errors (RPE) – a critical signal for successful instrumental learning – and the extent to which they are augmented to novel stimuli. The goal of the current study was to investigate the associations of abuse and neglect, and neural representation of RPE to novel and non-novel stimuli.
Methods
One hundred and seventy-eight participants (aged 10–18, M = 14.9, s.d. = 2.38) engaged in the Novelty task while undergoing functional magnetic resonance imaging. In this task, participants learn to choose novel or non-novel stimuli to win monetary rewards varying from $0 to $0.30 per trial. Levels of abuse and neglect were measured using the Childhood Trauma Questionnaire.
Results
Adolescents exposed to high levels of neglect showed reduced RPE-modulated blood oxygenation level dependent response within medial and lateral frontal cortices particularly when exploring novel stimuli (p < 0.05, corrected for multiple comparisons) relative to adolescents exposed to lower levels of neglect.
Conclusions
These data expand on previous work by indicating that neglect, but not abuse, is associated with impairments in neural RPE representation within medial and lateral frontal cortices. However, there was no association between neglect and behavioral impairments on the Novelty task, suggesting that these neural differences do not necessarily translate into behavioral differences within the context of the Novelty task.
This study proposes a novel hybrid learning approach for developing a visual path-following algorithm for industrial robots. The process involves three steps: data collection from a simulation environment, network training, and testing on a real robot. The actor network is trained using supervised learning for 500 epochs. A semitrained network is then obtained at the $250^{th}$ epoch. This network is further trained for another 250 epochs using reinforcement learning methods within the simulation environment. Networks trained with supervised learning (500 epochs) and the proposed hybrid learning method (250 epochs each of supervised and reinforcement learning) are compared. The hybrid learning approach achieves a significantly lower average error (30.9 mm) compared with supervised learning (39.3 mm) on real-world images. Additionally, the hybrid approach exhibits faster processing times (31.7 s) compared with supervised learning (35.0 s). The proposed method is implemented on a KUKA Agilus KR6 R900 six-axis robot, demonstrating its effectiveness. Furthermore, the hybrid approach reduces the total power consumption of the robot’s motors compared with the supervised learning method. These results suggest that the hybrid learning approach offers a more effective and efficient solution for visual path following in industrial robots compared with traditional supervised learning.
Selective serotonin reuptake inhibitors (SSRIs) are first-line pharmacological treatments for depression and anxiety. However, little is known about how pharmacological action is related to cognitive and affective processes. Here, we examine whether specific reinforcement learning processes mediate the treatment effects of SSRIs.
Methods
The PANDA trial was a multicentre, double-blind, randomized clinical trial in UK primary care comparing the SSRI sertraline with placebo for depression and anxiety. Participants (N = 655) performed an affective Go/NoGo task three times during the trial and computational models were used to infer reinforcement learning processes.
Results
There was poor task performance: only 54% of the task runs were informative, with more informative task runs in the placebo than in the active group. There was no evidence for the preregistered hypothesis that Pavlovian inhibition was affected by sertraline. Exploratory analyses revealed that in the sertraline group, early increases in Pavlovian inhibition were associated with improvements in depression after 12 weeks. Furthermore, sertraline increased how fast participants learned from losses and faster learning from losses was associated with more severe generalized anxiety symptoms.
Conclusions
The study findings indicate a relationship between aversive reinforcement learning mechanisms and aspects of depression, anxiety, and SSRI treatment, but these relationships did not align with the initial hypotheses. Poor task performance limits the interpretability and likely generalizability of the findings, and highlights the critical importance of developing acceptable and reliable tasks for use in clinical studies.
Funding
This article presents research supported by NIHR Program Grants for Applied Research (RP-PG-0610-10048), the NIHR BRC, and UCL, with additional support from IMPRS COMP2PSYCH (JM, QH) and a Wellcome Trust grant (QH).
Developing an artificial design agent that mimics human design behaviors through the integration of heuristics is pivotal for various purposes, including advancing design automation, fostering human-AI collaboration, and enhancing design education. However, this endeavor necessitates abundant behavioral data from human designers, posing a challenge due to data scarcity for many design problems. One potential solution lies in transferring learned design knowledge from one problem domain to another. This article aims to gather empirical evidence and computationally evaluate the transferability of design knowledge represented at a high level of abstraction across different design problems. Initially, a design agent grounded in reinforcement learning (RL) is developed to emulate human design behaviors. A data-driven reward mechanism, informed by the Markov chain model, is introduced to reinforce prominent sequential design patterns. Subsequently, the design agent transfers the acquired knowledge from a source task to a target task using a problem-agnostic high-level representation. Through a case study involving two solar system designs, one dataset trains the design agent to mimic human behaviors, while another evaluates the transferability of these learned behaviors to a distinct problem. Results demonstrate that the RL-based agent outperforms a baseline model utilizing the first-order Markov chain model in both the source task without knowledge transfer and the target task with knowledge transfer. However, the model’s performance is comparatively lower in predicting the decisions of low-performing designers, suggesting caution in its application, as it may yield unsatisfactory results when mimicking such behaviors.
The increase in Electrical and Electronic Equipment (EEE) usage in various sectors has given rise to repair and maintenance units. Disassembly of parts requires proper planning, which is done by the Disassembly Sequence Planning (DSP) process. Since the manual disassembly process has various time and labor restrictions, it requires proper planning. Effective disassembly planning methods can encourage the reuse and recycling sector, resulting in reduction of raw-materials mining. An efficient DSP can lower the time and cost consumption. To address the challenges in DSP, this research introduces an innovative framework based on Q-Learning (QL) within the domain of Reinforcement Learning (RL). Furthermore, an Enhanced Simulated Annealing (ESA) algorithm is introduced to improve the exploration and exploitation balance in the proposed RL framework. The proposed framework is extensively evaluated against state-of-the-art frameworks and benchmark algorithms using a diverse set of eight products as test cases. The findings reveal that the proposed framework outperforms benchmark algorithms and state-of-the-art frameworks in terms of time consumption, memory consumption, and solution optimality. Specifically, for complex large products, the proposed technique achieves a remarkable minimum reduction of 60% in time consumption and 30% in memory usage compared to other state-of-the-art techniques. Additionally, qualitative analysis demonstrates that the proposed approach generates sequences with high fitness values, indicating more stable and less time-consuming disassembles. The utilization of this framework allows for the realization of various real-world disassembly applications, thereby making a significant contribution to sustainable practices in EEE industries.
The use of machine learning in robotics is a vast and growing area of research. In this chapter we consider a few key variations using: the use of deep neural networks, the applications of reinforcement learning and especially deep reinforcement learning, and the rapidly emerging potential for large language models.
We often forego a larger future reward in order to obtain a smaller reward immediately, known as impatient intertemporal choice. The current study investigated the role of Pavlovian-to-instrumental transfer (PIT) as a mechanism contributing to impatient intertemporal choice, following a theoretical framework proposing that cues associated with immediate gratification trigger a Pavlovian approach response, interfering with goal-directed (instrumental) inhibitory behavior. We developed a paradigm in which participants first learned to make instrumental go/no-go responses in order to win rewards and avoid punishments. Next, they learned the associations between Pavlovian cues and rewards varying in amount and delay. Finally, we tested whether these (task-irrelevant) cues exerted transfer effects by influencing instrumental actions while participants again completed the go/no-go task. Across two experiments, Pavlovian cues associated with larger (versus smaller) and immediate (versus delayed) rewards were evaluated more positively, reflecting the successful acquisition of Pavlovian cue–outcome associations. These findings replicated the previously reported classical transfer effect of reward amount on instrumental behavior, as large (versus smaller) cues increased instrumental approach. In contrast, we found no evidence for the hypothesized transfer effects for reward delay, contrary to the proposed role of PIT in impatient intertemporal choice. These results suggest that although both reward amount and delay were important in the evaluation of cues, only the amount associated with cues influenced instrumental choice. We provide concrete suggestions for future studies, addressing instrumental outcome identity, competition between cue–amount and cue–delay associations, and individual differences in response to Pavlovian cues.
Individuals with cocaine use disorder or gambling disorder demonstrate impairments in cognitive flexibility: the ability to adapt to changes in the environment. Flexibility is commonly assessed in a laboratory setting using probabilistic reversal learning, which involves reinforcement learning, the process by which feedback from the environment is used to adjust behavior.
Aims
It is poorly understood whether impairments in flexibility differ between individuals with cocaine use and gambling disorders, and how this is instantiated by the brain. We applied computational modelling methods to gain a deeper mechanistic explanation of the latent processes underlying cognitive flexibility across two disorders of compulsivity.
Method
We present a re-analysis of probabilistic reversal data from individuals with either gambling disorder (n = 18) or cocaine use disorder (n = 20) and control participants (n = 18), using a hierarchical Bayesian approach. Furthermore, we relate behavioural findings to their underlying neural substrates through an analysis of task-based functional magnetic resonanceimaging (fMRI) data.
Results
We observed lower ‘stimulus stickiness’ in gambling disorder, and report differences in tracking expected values in individuals with gambling disorder compared to controls, with greater activity during reward expected value tracking in the cingulate gyrus and amygdala. In cocaine use disorder, we observed lower responses to positive punishment prediction errors and greater activity following negative punishment prediction errors in the superior frontal gyrus compared to controls.
Conclusions
Using a computational approach, we show that individuals with gambling disorder and cocaine use disorder differed in their perseverative tendencies and in how they tracked value neurally, which has implications for psychiatric classification.
A fleet of aircraft can be seen as a set of degrading systems that undergo variable loads as they fly missions and require maintenance throughout their lifetime. Optimal fleet management aims to maximise fleet availability while minimising overall maintenance costs. To achieve this goal, individual aircraft, with variable age and degradation paths, need to operate cooperatively to maintain high fleet availability while avoiding mechanical failure by scheduling preventive maintenance actions. In recent years, reinforcement learning (RL) has emerged as an effective method to optimise complex sequential decision-making problems. In this paper, an RL framework to optimise the operation and maintenance of a fleet of aircraft is developed. Three cases studies, with varying number of aircraft in the fleet, are used to demonstrate the ability of the RL policies to outperform traditional operation/maintenance strategies. As more aircraft are added to the fleet, the combinatorial explosion of the number of possible actions is identified as a main computational limitation. We conclude that the RL policy has potential to support fleet management operators and call for greater research on the application of multi-agent RL for fleet availability optimisation.
Dive into the foundations of intelligent systems, machine learning, and control with this hands-on, project-based introductory textbook. Precise, clear introductions to core topics in fuzzy logic, neural networks, optimization, deep learning, and machine learning, avoid the use of complex mathematical proofs, and are supported by over 70 examples. Modular chapters built around a consistent learning framework enable tailored course offerings to suit different learning paths. Over 180 open-ended review questions support self-review and class discussion, over 120 end-of-chapter problems cement student understanding, and over 20 hands-on Arduino assignments connect theory to practice, supported by downloadable Matlab and Simulink code. Comprehensive appendices review the fundamentals of modern control, and contain practical information on implementing hands-on assignments using Matlab, Simulink, and Arduino. Accompanied by solutions for instructors, this is the ideal guide for senior undergraduate and graduate engineering students, and professional engineers, looking for an engaging and practical introduction to the field.
In this chapter, we introduce some of the more popular ML algorithms. Our objective is to provide the basic concepts and main ideas, how to utilize these algorithms using Matlab, and offer some examples. In particular, we discuss essential concepts in feature engineering and how to apply them in Matlab. Support vector machines (SVM), K-nearest neighbor (KNN), linear regression, Naïve Bayes algorithm, and decision trees are introduced and the fundamental underlying mathematics is explained while using Matlab’s corresponding Apps to implement each of these algorithms. A special section on reinforcement learning is included, detailing the key concepts and basic mechanism of this third ML category. In particular, we showcase how to implement reinforcement learning in Matlab as well as make use of some of the Python libraries available online and show how to use reinforcement learning for controller design.
The growing need for agricultural products and the challenges posed by environmental and economic factors have created a demand for enhanced agricultural systems management. Machine learning has increasingly been leveraged to tackle agricultural optimization problems, and in particular, reinforcement learning (RL), a subfield of machine learning, seems a promising tool for data-driven discovery of future farm management policies. In this work, we present the development of CropGym, a Gymnasium environment, where a reinforcement learning agent can learn crop management policies using a variety of process-based crop growth models. As a use case, we report on the discovery of strategies for nitrogen application in winter wheat. An RL agent is trained to decide weekly on applying a discrete amount of nitrogen fertilizer, with the aim of achieving a balance between maximizing yield and minimizing environmental impact. Results show that close to optimal strategies are learned, competitive with standard practices set by domain experts. In addition, we evaluate, as an out-of-distribution test, whether the obtained policies are resilient against a change in climate conditions. We find that, when rainfall is sufficient, the RL agent remains close to the optimal policy. With CropGym, we aim to facilitate collaboration between the RL and agronomy communities to address the challenges of future agricultural decision-making.
Learning from rewarded and punished choices is perturbed in depressed patients, suggesting that abnormal reinforcement learning may be a cognitive mechanism of the illness. However, previous studies have disagreed about whether this behavior is produced by alterations in the rate of learning or sensitivity to experienced outcomes. This previous work has generally assessed learning in response to binary outcomes of one valence, rather than to both rewarding and punishing continuous outcomes.
Methods
A novel drifting reward and punishment magnitude reinforcement-learning task was administered to patients with current (n = 40) and remitted depression (n = 39), and healthy volunteers (n = 40) to capture potential differences in learning behavior. Standard questionnaires were administered to measure self-reported depressive symptom severity, trait and state anxiety and level of anhedonic symptoms.
Results
Our findings demonstrate that patients with current depression adjust their learning behaviors to a lesser degree in response to trial-by-trial variations in reward and loss magnitudes than the other groups. Computational modeling revealed that this behavioral signature of current depressive state is better accounted for by reduced reward and punishment sensitivity (all p < 0.031), rather than a change in learning rate (p = 0.708). However, between-group differences were not related to self-reported symptom severity or comorbid anxiety disorders in the current depression group.
Conclusion
These findings suggest that current depression is associated with reduced outcome sensitivity rather than altered learning rate. Previous findings reported in this domain mainly from binary learning tasks seem to generalize to learning from continuous outcomes.
In this paper, a super-twisting disturbance observer (STDO)-based adaptive reinforcement learning control scheme is proposed for the straight air compound missile system with aerodynamic uncertainties and unmodeled dynamics. Firstly, neural network (NN)-based adaptive reinforcement learning control scheme with actor-critic design is investigated to deal with the tracking problems for the straight gas compound system. The actor NN and the critic NN are utilised to cope with the unmodeled dynamics and approximate the cost function that are related to control input and tracking error, respectively. In other words, the actor NN is used to perform the tracking control behaviours, and the critic NN aims to evaluate the tracking performance and give feedback to actor NN. Moreover, with the aid of the STDO disturbance observer, the problem of the control signal fluctuation caused by the mismatched disturbance can be solved well. Based on the proposed adaptive law and the Lyapunov direct method, the eventually consistent boundedness of the straight gas compound system is proved. Finally, numerical simulations are carried out to demonstrate the feasibility and superiority of the proposed reinforcement learning-based STDO control algorithm.
We describe three sampling models that aim to cast light on how some design features of social media platforms systematically affect judgments of their users. We specify the micro-mechanisms of belief formation and interactions and explore their macro implications such as opinion polarization. Each model focuses on a specific aspect of platform-mediated social interactions: how popularity creates additional exposure to contrarian arguments; how differences in popularity make an agent more likely to hear particularly persuasive arguments in support of popular options; and how opinions in favor of popular options are reinforced through social feedback. We show that these mechanisms lead to self-reinforcing dynamics that can result in local opinion homogenization and between-group polarization. Unlike nonsampling-based approaches, our focus does not lie in peculiarities of information processing such as motivated cognition but instead emphasizes how structural features of the learning environment contribute to opinion homogenization and polarization.
Cognitive distancing is an emotion regulation strategy commonly used in psychological treatment of various mental health disorders, but its therapeutic mechanisms are unknown.
Methods
935 participants completed an online reinforcement learning task involving choices between pairs of symbols with differing reward contingencies. Half (49.1%) of the sample was randomised to a cognitive self-distancing intervention and were trained to regulate or ‘take a step back’ from their emotional response to feedback throughout. Established computational (Q-learning) models were then fit to individuals' choices to derive reinforcement learning parameters capturing clarity of choice values (inverse temperature) and their sensitivity to positive and negative feedback (learning rates).
Results
Cognitive distancing improved task performance, including when participants were later tested on novel combinations of symbols without feedback. Group differences in computational model-derived parameters revealed that cognitive distancing resulted in clearer representations of option values (estimated 0.17 higher inverse temperatures). Simultaneously, distancing caused increased sensitivity to negative feedback (estimated 19% higher loss learning rates). Exploratory analyses suggested this resulted from an evolving shift in strategy by distanced participants: initially, choices were more determined by expected value differences between symbols, but as the task progressed, they became more sensitive to negative feedback, with evidence for a difference strongest by the end of training.
Conclusions
Adaptive effects on the computations that underlie learning from reward and loss may explain the therapeutic benefits of cognitive distancing. Over time and with practice, cognitive distancing may improve symptoms of mental health disorders by promoting more effective engagement with negative information.