Introduction
Artificial intelligence (AI) is defined as the ability of a system to interpret external data, learn from it, and accomplish specific goals through adaptation (Haenlein & Kaplan, Reference Haenlein and Kaplan2019). AI, particularly machine learning, has shown promise in surpassing human capabilities in various tasks such as medical image analysis, clinical documentation, and patient monitoring (Bohr & Memarzadeh, Reference Bohr and Memarzadeh2020; Davenport & Kalakota, Reference Davenport and Kalakota2019). Machine learning is a technique that uses advanced statistical and probabilistic methods to build systems that improve through experience, enabling prediction and categorization of data, particularly in mental health research (Chung & Teo, Reference Chung and Teo2022). Traditional machine learning is commonly used in precision medicine to predict successful treatments based on patient attributes and treatment context (Davenport & Kalakota, Reference Davenport and Kalakota2019). Neural networks are advanced algorithms in machine learning that are designed to mimic the human brain function, enabling them to solve complex problems like image and speech recognition (Chung & Teo, Reference Chung and Teo2022). Neural networks are employed to categorize patients and determine the likelihood of developing specific diseases (Davenport & Kalakota, Reference Davenport and Kalakota2019). Deep learning is a subset of machine learning that utilizes neural networks to automatically learn and solve complex problems, including image and speech recognition, and natural language processing (Chung & Teo, Reference Chung and Teo2022). Deep learning utilizes multiple layers of features to predict outcomes, such as disease prognosis and patient mortality in cancer cases (Lu et al., Reference Lu, Xu, Nguyen, Geng, Pfob and Sidey-Gibbons2022; Zhang et al., Reference Zhang, Nikouline, Lightfoot and Nolan2022). Another application of deep learning is speech recognition through natural language processing, which aims to understand human language through speech recognition, text analysis, and translation (Locke et al., Reference Locke, Bashall, Al-Adely, Moore, Wilson and Kitchen2021; Nassif et al., Reference Nassif, Shahin, Attili, Azzeh and Shaalan2019), for assisting in tasks such as creating, analyzing, and classifying clinical documentation, transcribing patient interactions, and conducting conversations (Buchlak et al., Reference Buchlak, Esmaili, Farrokhi and Bennett2022; Casey et al., Reference Casey, Davidson, Poon, Dong, Duma, Grivas and Alex2021; Davenport & Kalakota, Reference Davenport and Kalakota2019; Kreimeyer et al., Reference Kreimeyer, Foster, Pandey, Arya, Halford, Jones and Botsis2017).
AI in the field of mental health has witnessed significant growth (Cecula et al., Reference Cecula, Yu, Dawoodbhoy, Delaney, Tan, Peacock and Cox2021) with studies exploring its potential in the early detection of mental illnesses, treatment planning (Ćosić et al., Reference Ćosić, Popović, Šarlija, Kesedžić and Jovanovic2020; Johnson et al., Reference Johnson, Wei, Weeraratne, Frisse, Misulis, Rhee and Snowdon2021), speech signal analysis in therapy sessions (Goldberg et al., Reference Goldberg, Flemotomos, Martinez, Tanana, Kuo, Pace and Atkins2020), and continuous patient monitoring (Bohr & Memarzadeh, Reference Bohr and Memarzadeh2020). Given the rising global demand for accurate diagnosis, improved monitoring, and effective interventions in mental health, AI holds promise as a powerful tool. The demand for mental health diagnosis and treatment further intensified during the COVID-19 pandemic, with a notable increase in depressive symptoms, anxiety, and distress worldwide (Davenport et al., Reference Davenport, Cheng, Iorfino, Hamilton, Castaldi, Burton and Hickie2020; Latoo et al., Reference Latoo, Haddad, Mistry, Wadoo, Islam, Jan and Alabdulla2021; Moreno et al., Reference Moreno, Wykes, Galderisi, Nordentoft, Crossley, Jones and Arango2020). To address the substantial increase in global demand for mental health resources, the use of AI tools has emerged as a potential solution. By leveraging AI, various applications can be developed to support and enhance mental health services. AI-assisted diagnosis tools can enable early detection and treatment (Ćosić et al., Reference Ćosić, Popović, Šarlija, Kesedžić and Jovanovic2020; Johnson et al., Reference Johnson, Wei, Weeraratne, Frisse, Misulis, Rhee and Snowdon2021). AI-powered monitoring can facilitate continuous and remote mental health assessments, reducing the need for patients to travel to healthcare facilities (Graham et al., Reference Graham, Depp, Lee, Nebeker, Tu, Kim and Jeste2019). AI-based interventions have the potential to address this demand by offering scalable and adaptable solutions to different populations (Bickman, Reference Bickman2020; Koutsouleris et al., Reference Koutsouleris, Hauser, Skvortsova and De Choudhury2022).
To advance AI technology in the field of mental health and overcome its current limitations, it is crucial to have a comprehensive understanding of how AI can be applied throughout the patient journey. The need for a comprehensive review of the application of AI in mental health research and clinical practice is underscored by the growing reliance on technology to address pressing mental health challenges. As AI systems become increasingly proficient in interpreting data and producing actionable insights, they present an opportunity to enhance traditional approaches to mental health diagnostics, monitoring, and interventions. The increasing demand for mental health services, exacerbated by the COVID-19 pandemic, emphasizes the importance of leveraging AI to facilitate early detection of mental illnesses, optimize treatment planning, and provide continuous patient support. By systematically evaluating the existing literature, this review will elucidate how AI can transform mental health care, potentially leading to more accurate diagnoses, personalized treatment plans, and efficient resource allocation, thereby contributing significantly to the overall understanding of AI’s role in strengthening mental health systems worldwide.
AI in mental health is hampered by difficulties in obtaining high-quality, representative data, along with data security concerns, lack of training resources, and fragmented formats (Koutsouleris et al., Reference Koutsouleris, Hauser, Skvortsova and De Choudhury2022). Additionally, the belief that clinical judgment outweighs quantitative measures slows advancements in digital health care and AI applications (Koutsouleris et al., Reference Koutsouleris, Hauser, Skvortsova and De Choudhury2022). This systematic review aims to analyze the current status of AI in mental health care focusing specifically on its application in the areas of diagnosis, monitoring, and intervention as well as to identify the limitations, challenges, and ethical considerations associated with the use of AI technologies. Focusing this systematic review on three critical domains – diagnosis, monitoring, and intervention – allows for a targeted analysis of the multifaceted ways in which AI can enhance mental health care. In the diagnosis domain, exploring AI’s role can reveal its potential for early identification of mental health conditions, improving patient outcomes through timely intervention. In terms of monitoring, AI technologies can enable ongoing assessments that are essential for tracking patient progress and adapting treatment plans effectively. Finally, examining AI-assisted interventions showcases how scalable digital solutions can address the growing demand for accessible mental health resources. By dissecting these three domains, the review will not only highlight the strengths and limitations of AI applications but also address ethical considerations, ultimately guiding future research and innovation in mental health technology. This systematic review was prepared to answer the following questions:
-
1. How is AI used in diagnosing mental health illnesses, monitoring disease progression and treatment effectiveness, and conducting AI-assisted mental health interventions?
-
2. What are the limitations, challenges, and ethical concerns in the application of AI technologies in mental health?
Methods
Search strategy
The systematic review was conducted following PRISMA guidelines, and it was registered on PROSPERO (registration number: CRD42023388503). The literature search was conducted using the Cumulative Index to Nursing and Allied Health (CINAHL), Cochrane Central Register of Controlled Trials (CCRT), PubMed, PsycINFO, and Scopus databases from inception to August 2024. The search terms and search strategy that were used to retrieve relevant research studies are shown in Table 1. Filters were applied to retrieve research studies, including “Trials” for CCRT; “Full text,” “English language,” and “Randomized controlled trials” for CINAHL; “Clinical trial” and “English” for PsycINFO; “Full text,” “Clinical trial,” “Randomized controlled trial,” and “English” for PubMed; and “Article,” “Journal,” “English,” and “Doc title, abstract, keyword” for Scopus.
Table 1. Search terms and database search strategy
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20250205021803133-0085:S0033291724003295:S0033291724003295_tab1.png?pub-status=live)
Inclusion and exclusion criteria for study selection
General, domain-specific, and outcome measures inclusion and exclusion criteria were set for study selection.
General
Studies using AI-assisted diagnosis tools, AI-monitored treatment effectiveness and prognosis, or AI-based interventions in the context of mental health were included. Studies that did not include mental health outcomes or primarily targeted disorders such as dementia, attention-deficit/hyperactivity disorder, or autism spectrum disorders as well as drug abuse were excluded. Also, systematic reviews, meta-analyses, classical reviews, protocols, book chapters, conference presentations, and studies not written in English were excluded.
Domain-specific
Diagnosis
Studies that applied AI in detecting the presence of mental health disorders, predicting the risk of having mental health disorders, and identifying features that are associated with mental health disorders were included. Studies that classified subgroups of mental illnesses were excluded as diagnoses had already been made.
Monitoring
Those studies were included that adopted AI either to collect data for monitoring and predicting the ongoing prognosis of a mental health disorder or to monitor treatment effects. Studies that used AI to predict treatment-related mental health improvement or the risk of symptom remission prior to treatment initiation were excluded, as this review aimed to focus on monitoring treatment effectiveness and treatment-related mental health prognosis in clinical practice.
Intervention
Studies that applied any form of AI-assisted interventions were included. Studies that did not use AI-assisted interventions or used AI in other aspects of the research, such as data analysis and outcome prediction were excluded.
Outcome measures
The findings were presented in a systematic and narrative form, including the AI approaches used in mental health, the domain of mental health care, in which AI was applied, the presence and severity of mental health disorders or symptoms, and the accuracy or effectiveness of the AI-based tool. The application, limitations, challenges, and ethical concerns of AI in mental health were also critically discussed.
Selection of relevant studies
Two authors independently conducted the database search and the selection of studies. The study selection was carried out according to the inclusion and exclusion criteria. After the article search and removing duplicates, the titles and abstracts of the retrieved research studies were screened. The next screening for study selection was conducted by revising the full text. After selecting the studies, the authors reviewed the list of studies included. Discrepancies were resolved by a third author.
Data extraction
Three authors were involved in the data extraction, that is, one author per domain and one additional author revised the extracted data and resolved any discrepancies. The data extracted included AI approaches used in mental health, the mental health care domain in which AI was applied, the AI tool, sample size, effectiveness, as well as limitations, challenges, and ethical considerations of AI in mental health. Study investigators were contacted regarding any missing data.
Quality assessment
The National Heart, Lung, and Blood Institute’s (NHLBI) quality assessment tools were used to examine the quality of the studies included. The studies encompassed various types, including controlled intervention studies, observational cohort and cross-sectional studies, case-control studies, and before-after (pre–post) studies without a control group, with assessments conducted using different NHLBI tools. The number of items assessed for each study type was 14, 14, 12, and 12, respectively. For the scoring method, each item was categorized as “yes,” “no,” or “other” (e.g., “cannot determine,” “not applicable,” or “not reported”). The overall quality score, categorized as “good,” “fair,” or “poor,” was not simply a cumulative total, but rather a qualitative assessment derived from response patterns. Reviewers considered essential factors that could influence the validity of the study (https://www.nhlbi.nih.gov/health-topics/study-quality-assessment-tools). Two independent authors performed the quality appraisal, with a third author helping to resolve any disagreements.
Results
A total of 842 research studies were retrieved from five databases, including CINAHL, CCRT, PubMed, PsycINFO, and Scopus. After screening and removing duplicates, a total of 32 studies were included in diagnosis, 39 in monitoring, 13 in intervention, and one in both diagnosis and monitoring (Figure 1).
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20250205021803133-0085:S0033291724003295:S0033291724003295_fig1.png?pub-status=live)
Figure 1. PRISMA flowchart of study identification, screening, and selection.
Diagnosis
Thirty-two studies in the diagnosis domain trained and developed machine learning algorithms to detect and predict mental health conditions (Table 2). The target population (n = 327,625) involved individuals who developed or were susceptible to developing mental health conditions. The most common algorithms included support vector machine, a supervised learning algorithm for classification and regression tasks that finds the optimal hyperplane to maximize the margin between different classes (Adler et al., Reference Adler, Wang, Mohr and Choudhury2022; Byun et al., Reference Byun, Kim, Jang, Kim, Choi, Yu and Jeon2019; Chilla et al., Reference Chilla, Yeow, Chew, Sim and Prakash2022; Das & Naskar, Reference Das and Naskar2024; Ebdrup et al., Reference Ebdrup, Axelsen, Bak, Fagerlund, Oranje, Raghava and Glenthøj2019; Geng et al., Reference Geng, An, Fu, Wang and An2023; Marquand, Mourão-Miranda, Brammer, Cleare, & Fu, Reference Marquand, Mourão-Miranda, Brammer, Cleare and Fu2008; Matsuo et al., Reference Matsuo, Ushida, Emoto, Moriyama, Iitani, Nakamura and Kotani2022; Mohamed et al., Reference Mohamed, Naqishbandi, Bukhari, Rauf, Sawrikar and Hussain2023; Mongan et al., Reference Mongan, Föcking, Healy, Susai, Heurich and Wynne2021; Pestian et al., Reference Pestian, Grupp-Phelan, Bretonnel Cohen, Meyers, Richey, Matykiewicz and Sorter2016; Schnack et al., Reference Schnack, Nieuwenhuis, Van Haren, Abramovic, Scheewe, Brouwer and Kahn2014; Setoyama et al., Reference Setoyama, Kato, Hashimoto, Kunugi, Hattori, Hayakawa and Kanba2016; Susai et al., Reference Susai, Mongan, Healy, Cannon, Cagney, Wynne and Cotter2022; Tate et al., Reference Tate, McCabe, Larsson, Lundström, Lichtenstein and Kuja-Halkola2020) and random forest, an ensemble learning method that improves predictive accuracy by aggregating the outputs of multiple decision trees, reduces overfitting while enhancing model robustness (Andersson et al., Reference Andersson, Bathula, Iliadis, Walter and Skalkidou2021; Chen et al., Reference Chen, Chan, Li, Chan, Liu, Li and Wing2024; Chilla et al., Reference Chilla, Yeow, Chew, Sim and Prakash2022; Ebdrup et al., Reference Ebdrup, Axelsen, Bak, Fagerlund, Oranje, Raghava and Glenthøj2019; Hüfner et al., Reference Hüfner, Tymoszuk, Ausserhofer, Sahanic, Pizzini, Rass and Sperner-Unterweger2022; Jacobson et al., Reference Jacobson, Yom-Tov, Lekkas, Heinz, Liu and Barr2022; Kourou et al., Reference Kourou, Manikis, Mylona, Poikonen-Saksela, Mazzocco, Pat-Horenczyk and Fotiadis2023; Lønfeldt et al., Reference Lønfeldt, Olesen, Das, Mora-Jensen, Pagsberg and Clemmensen2023; Manikis et al., Reference Manikis, Simos, Kourou, Kondylakis, Poikonen-Saksela, Mazzocco and Fotiadis2023; Matsuo et al., Reference Matsuo, Ushida, Emoto, Moriyama, Iitani, Nakamura and Kotani2022; Mohamed et al., Reference Mohamed, Naqishbandi, Bukhari, Rauf, Sawrikar and Hussain2023; Setoyama et al., Reference Setoyama, Kato, Hashimoto, Kunugi, Hattori, Hayakawa and Kanba2016; Tate et al., Reference Tate, McCabe, Larsson, Lundström, Lichtenstein and Kuja-Halkola2020). Machine learning was used to diagnose specific mental disorders such as depression (Byun et al., Reference Byun, Kim, Jang, Kim, Choi, Yu and Jeon2019; Carrillo et al., Reference Carrillo, Sigman, Fernández Slezak, Ashton, Fitzgerald, Stroud and Carhart-Harris2018; Chen et al., Reference Chen, Chan, Li, Chan, Liu, Li and Wing2024; Das & Naskar, Reference Das and Naskar2024; Du et al., Reference Du, Liu, Balamurugan and Selvaraj2021; Geng et al., Reference Geng, An, Fu, Wang and An2023; Hüfner et al., Reference Hüfner, Tymoszuk, Ausserhofer, Sahanic, Pizzini, Rass and Sperner-Unterweger2022; Maglanoc et al., Reference Maglanoc, Kaufmann, Jonassen, Hilland, Beck, Landrø and Westlye2020; Marquand et al., Reference Marquand, Mourão-Miranda, Brammer, Cleare and Fu2008; Setoyama et al., Reference Setoyama, Kato, Hashimoto, Kunugi, Hattori, Hayakawa and Kanba2016; Xu et al., Reference Xu, Thompson, Kerr, Godbole, Sears, Patterson and Natarajan2018), schizophrenia (Chilla et al., Reference Chilla, Yeow, Chew, Sim and Prakash2022; Ebdrup et al., Reference Ebdrup, Axelsen, Bak, Fagerlund, Oranje, Raghava and Glenthøj2019; Liang et al., Reference Liang, Vega, Kong, Deng, Wang, Ma and Li2018; Schnack et al., Reference Schnack, Nieuwenhuis, Van Haren, Abramovic, Scheewe, Brouwer and Kahn2014), and suicide (Cook et al., Reference Cook, Progovac, Chen, Mullin, Hou and Baca-Garcia2016; Jacobson et al., Reference Jacobson, Yom-Tov, Lekkas, Heinz, Liu and Barr2022; Jankowsky et al., Reference Jankowsky, Krakau, Schroeders, Zwerenz and Beutel2024; Lyu & Zhang, Reference Lyu and Zhang2019; Pestian et al., Reference Pestian, Grupp-Phelan, Bretonnel Cohen, Meyers, Richey, Matykiewicz and Sorter2016; Simon et al., Reference Simon, Shortreed, Johnson, Rossom, Lynch, Ziebell and Penfold2019; Tsui et al., Reference Tsui, Shi, Ruiz, Ryan, Biernesser, Iyengar and Brent2021; Yang et al., Reference Yang, Chung, Rhee, Park, Kim, Lee and Ahn2024); and it was less frequently applied in the diagnoses of anxiety (Hüfner et al., Reference Hüfner, Tymoszuk, Ausserhofer, Sahanic, Pizzini, Rass and Sperner-Unterweger2022; Maglanoc et al., Reference Maglanoc, Kaufmann, Jonassen, Hilland, Beck, Landrø and Westlye2020), bipolar disorder (Schnack et al., Reference Schnack, Nieuwenhuis, Van Haren, Abramovic, Scheewe, Brouwer and Kahn2014), obsessive-compulsive disorder (Lønfeldt et al., Reference Lønfeldt, Olesen, Das, Mora-Jensen, Pagsberg and Clemmensen2023), and postpartum depression (Andersson et al., Reference Andersson, Bathula, Iliadis, Walter and Skalkidou2021; Matsuo et al., Reference Matsuo, Ushida, Emoto, Moriyama, Iitani, Nakamura and Kotani2022); and to detect mental health symptoms such as depressive, anxious, and schizophrenia symptoms (Adler et al., Reference Adler, Wang, Mohr and Choudhury2022; C Manikis et al., Reference Manikis, Simos, Kourou, Kondylakis, Poikonen-Saksela, Mazzocco and Fotiadis2023; Kourou et al., Reference Kourou, Manikis, Mylona, Poikonen-Saksela, Mazzocco, Pat-Horenczyk and Fotiadis2023; Maekawa et al., Reference Maekawa, Grua, Nakamura, Scazufca, Araya, Peters and Van De Ven2024; Mongan et al., Reference Mongan, Föcking, Healy, Susai, Heurich and Wynne2021; Susai et al., Reference Susai, Mongan, Healy, Cannon, Cagney, Wynne and Cotter2022; Tate et al., Reference Tate, McCabe, Larsson, Lundström, Lichtenstein and Kuja-Halkola2020) as well as outcomes associated with quality of life (Hüfner et al., Reference Hüfner, Tymoszuk, Ausserhofer, Sahanic, Pizzini, Rass and Sperner-Unterweger2022; Manikis et al., Reference Manikis, Simos, Kourou, Kondylakis, Poikonen-Saksela, Mazzocco and Fotiadis2023; Xu et al., Reference Xu, Thompson, Kerr, Godbole, Sears, Patterson and Natarajan2018). The predictors that were commonly used to detect and predict mental health conditions included demographic information, socioeconomic data, clinical history, physiological data, psychometric data, medical scan biomarkers, and semantic contents (Table 2). Demographic information, socioeconomic data, and clinical history were retrieved from electronic health records. Examples of medical scans used as input in the AI models were MRI scans (Chilla et al., Reference Chilla, Yeow, Chew, Sim and Prakash2022; Ebdrup et al., Reference Ebdrup, Axelsen, Bak, Fagerlund, Oranje, Raghava and Glenthøj2019; Marquand et al., Reference Marquand, Mourão-Miranda, Brammer, Cleare and Fu2008), EEG signals (Du et al., Reference Du, Liu, Balamurugan and Selvaraj2021), and HRV signals (Geng et al., Reference Geng, An, Fu, Wang and An2023), whereas biomarkers included aqueous metabolites in blood plasma (Setoyama et al., Reference Setoyama, Kato, Hashimoto, Kunugi, Hattori, Hayakawa and Kanba2016), gray matter density (Maglanoc et al., Reference Maglanoc, Kaufmann, Jonassen, Hilland, Beck, Landrø and Westlye2020; Schnack et al., Reference Schnack, Nieuwenhuis, Van Haren, Abramovic, Scheewe, Brouwer and Kahn2014), and proteomics data from plasma samples (Mongan et al., Reference Mongan, Föcking, Healy, Susai, Heurich and Wynne2021).
Table 2. Studies on AI-assisted diagnosis in mental health
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20250205021803133-0085:S0033291724003295:S0033291724003295_tab2.png?pub-status=live)
Abbreviations: AUDIT: Alcohol Use Disorder Identification Test; ALSPAC: Avon Longitudinal Study of Parents and Children; AMT: Autobiographical memory test; AUROC and AUC: Area under the receiver operating characteristic curve; BACS: Brief Assessment of Cognition in Schizophrenia; BAI: Beck Anxiety Inventory; BIS-11: Barratt Impulsiveness Scale-11; BP: Background, medical history, and pregnancy/delivery variables; CAARMS: Comprehensive Assessment of At-Risk Mental State; CANTAB: Cambridge Neuropsychological Test Automated Battery; CNN: Convolutional Neural Network; CPTB: Copenhagen Psychophysiology Test Battery; C-SSRS: Columbia Suicide Severity Rating Scale; DART: Danish Adult Reading Test; DRF: Distributed Random Forests; DSM-IV: Diagnostic and Statistical Manual of Mental Disorder-IV; DT: Decision tree; DTI: Diffusion tensor imaging; EEG: electroencephalogram; EHR: Electronic Health Record; EMA: Ecological momentary assessment; EPDS: Edinburgh Postnatal Depression Scale; ERTC: Extremely randomized trees classifier; ETI-SF: Early Trauma Inventory-Short Form; EU-GEI: European Network of National Schizophrenia Networks Studying Gene–Environment Interactions Multimodal diagnostic accuracy; EXGB: Ensemble of extreme gradient boosting; fMRI: Functional magnetic resonance imaging; GBRT: Gradient Boosting Regression Trees; GHQ: General Health Questionnaire; HADS: Hospital Anxiety and Depression Scale; HAMD and HRSD: Hamilton Rating Scale for Depression; ICD: International Classification of Diseases; Koko: An online peer-to-peer crowdsourcing platform that teaches users cognitive reappraisal strategies that they use to help other users manage negative emotions; LASSO: Least absolute shrinkage and selection operator; LSTM: Long Short-Term Memory; MDD: Major Depressive Disorder; M.I.N.I.: Mini-International Neuropsychiatric Interview; MRI: Magnetic resonance imaging; p: p-value; NEURAPRO: A clinical trial conducted between March 2010 and the end of September 2014, tested the potential preventive role of omega-3 fatty acids in clinical high-risk participants; NLP: Natural Language Processing; NPV: Negative predictive value; PHQ: Patient Health Questionnaire; PPD: Postpartum depression; PPV: Positive Predictive Value; QoL: Quality of life; QOLm: Mental quality of life; RBC: Rank-biserial correlation; RF: Random Forest; RMSE: Root mean square error; RS: Resilience-14; SOC: Sense of Coherence-29; VPSQ: Vulnerable Personality Scale Questionnaire; SE: Regression coefficients; SF-36: 36-Item Short Form Survey; SIQ: Suicidal Ideation Questionnaire; sMRI: Structural magnetic resonance imaging; SOFAS: Social and Occupational Functional Assessment Score; SQ for KNHANES-SF: Stress Questionnaire for Korean National Health and Nutrition Examination Survey-Short Form; SVC and SVM: Support Vector Machine; SVM-RFE: Support Vector Machine Recursive Feature Elimination; UQ: Ubiquitous Questionnaire; WAIS III: Wechsler Adult Intelligence Scale® – Third Edition; W: Wilcoxon signed-rank test (one-sided) statistics; XRT: Extreme randomized forest.
Monitoring
In the monitoring domain, a total of 40 articles, encompassing the paper under the diagnosis and monitoring categories, were incorporated, involving a cumulative participant count of 168,077. Table 3 shows that most studies (22/40 studies) focused on monitoring depression, with major depressive disorder being the mental health condition most frequently monitored (15/40 studies). The remaining studies monitored multiple psychiatric disorders such as anxiety (Jacobson et al., Reference Jacobson, Yom-Tov, Lekkas, Heinz, Liu and Barr2022; Zainal & Newman, Reference Zainal and Newman2024), personality disorders (Jacobson et al., Reference Jacobson, Yom-Tov, Lekkas, Heinz, Liu and Barr2022), schizophrenia (Brandt et al., Reference Brandt, Ritter, Schneider-Thoma, Siafis, Montag, Ayrilmaz and Stuke2023; Dong et al., Reference Dong, Rokicki, Dwyer, Papiol, Streit, Rietschel and Koutsouleris2024; Jacobson et al., Reference Jacobson, Yom-Tov, Lekkas, Heinz, Liu and Barr2022), bipolar disorder (Busk et al., Reference Busk, Faurholt-Jepsen, Frost, Bardram, Vedel Kessing and Winther2020; Lee et al., Reference Lee, Mansur, Brietzke, Kapogiannis, Delgado-Peraza, Boutilier and McIntyre2021), multiple specific phobias (Hilbert et al., Reference Hilbert, Böhnlein, Meinke, Chavanne, Langhammer, Stumpe and Lueken2024), substance use disorder (Carreiro et al., Reference Carreiro, Ramanand, Taylor, Leach, Stapp, Sherestha and Indic2024), and the comorbidity of depression and anxiety (Webb et al., Reference Webb, Hirshberg, Davidson and Goldberg2022) (Table 3). Furthermore, there was one study on psychosis (Amminger et al., Reference Amminger, Mechelli, Rice, Kim, Klier, McNamara and Schäfer2015), one on pediatric obsessive-compulsive disorder (Lenhard et al., Reference Lenhard, Sauer, Andersson, Månsson, Mataix-Cols, Rück and Serlachius2018), and four articles discussed suicide (Barrigon et al., Reference Barrigon, Romero-Medrano, Moreno-Muñoz, Porras-Segovia, Lopez-Castroman, Courtet and Baca-Garcia2023; Choo et al., Reference Choo, Wall, Brodsky, Herzog, Mann, Stanley and Galfalvy2024; Rozek et al., Reference Rozek, Andres, Smith, Leifker, Arne, Jennings and Rudd2020; Solomonov et al., Reference Solomonov, Lee, Banerjee, Flückiger, Kanellopoulos, Gunning and Alexopoulos2021). In terms of research objectives, most studies focused on predicting treatment effectiveness or response (25/40) (Table 3). The studies monitored or predicted the effectiveness of pharmacological interventions, such as long-chain omega-3 fatty acids (Amminger et al., Reference Amminger, Mechelli, Rice, Kim, Klier, McNamara and Schäfer2015), citalopram (Chekroud et al., Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson, Trivedi and Corlett2016), and duloxetine (Maciukiewicz et al., Reference Maciukiewicz, Marshe, Hauschild, Foster, Rotzinger, Kennedy and Geraci2018), using AI. One article addressed both the prediction of treatment effectiveness and the prognosis of mental health disorders during treatment (Chekroud et al., Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson, Trivedi and Corlett2016). Some studies focused on predicting the effectiveness of psychotherapy, such as cognitive behavioral therapy (CBT; Lenhard et al., Reference Lenhard, Sauer, Andersson, Månsson, Mataix-Cols, Rück and Serlachius2018) and response to repetitive transcranial magnetic stimulation (Bailey et al., Reference Bailey, Hoy, Rogasch, Thomson, McQueen, Elliot and Fitzgerald2018; Dong et al., Reference Dong, Rokicki, Dwyer, Papiol, Streit, Rietschel and Koutsouleris2024). Some biomarkers, as well as sociodemographic (Chekroud et al., Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson, Trivedi and Corlett2016; Kautzky et al., Reference Kautzky, Dold, Bartova, Spies, Vanicek, Souery and Kasper2018; Vitinius et al., Reference Vitinius, Escherich, Deter, Hellmich, Jünger, Petrowski and Albus2019), somatic (Vitinius et al., Reference Vitinius, Escherich, Deter, Hellmich, Jünger, Petrowski and Albus2019), and emotional (Dougherty et al., Reference Dougherty, Clarke, Atli, Kuc, Schlosser, Dunlop and Ryslik2023) data, predicted treatment outcomes or were used to generate predictive models of treatment-resistant depression (Kautzky et al., Reference Kautzky, Dold, Bartova, Spies, Vanicek, Souery and Kasper2018). The effectiveness of prediction with online screening tools was also evaluated (Athreya et al., Reference Athreya, Brückl, Binder, John Rush, Biernacka, Frye and Bobo2021). All articles in the monitoring domain used machine learning models, with the most commonly used models including random forest (Bao et al., Reference Bao, Zhao, Li, Zhang, Wu, Ning and Yang2021; Brandt et al., Reference Brandt, Ritter, Schneider-Thoma, Siafis, Montag, Ayrilmaz and Stuke2023; Hammelrath et al., Reference Hammelrath, Hilbert, Heinrich, Zagorscak and Knaevelsrud2024; Hilbert et al., Reference Hilbert, Böhnlein, Meinke, Chavanne, Langhammer, Stumpe and Lueken2024; Jacobson et al., Reference Jacobson, Yom-Tov, Lekkas, Heinz, Liu and Barr2022; Kautzky et al., Reference Kautzky, Dold, Bartova, Spies, Vanicek, Souery and Kasper2018; Lenhard et al., Reference Lenhard, Sauer, Andersson, Månsson, Mataix-Cols, Rück and Serlachius2018; Nie et al., Reference Nie, Vairavan, Narayan, Ye and Li2018; Solomonov et al., Reference Solomonov, Lee, Banerjee, Flückiger, Kanellopoulos, Gunning and Alexopoulos2021; Van Bronswijk et al., Reference Van Bronswijk, DeRubeis, Lemmens, Peeters, Keefe, Cohen and Huibers2021; Wang, Wu, et al., Reference Wang, Wu, DeLorenzo and Yang2024; Zainal & Newman, Reference Zainal and Newman2024; Zou et al., Reference Zou, Zhang, Xiao, Bai, Li, Liang and Wang2023), support vector machine (Bailey et al., Reference Bailey, Hoy, Rogasch, Thomson, McQueen, Elliot and Fitzgerald2018; Bao et al., Reference Bao, Zhao, Li, Zhang, Wu, Ning and Yang2021; Browning et al., Reference Browning, Kingslake, Dourish, Goodwin, Harmer and Dawson2019; Carreiro et al., Reference Carreiro, Ramanand, Taylor, Leach, Stapp, Sherestha and Indic2024; Furukawa et al., Reference Furukawa, Debray, Akechi, Yamada, Kato, Seo and Efthimiou2020; Guilloux et al., Reference Guilloux, Bassi, Ding, Walsh, Turecki, Tseng and Sibille2015; Lenhard et al., Reference Lenhard, Sauer, Andersson, Månsson, Mataix-Cols, Rück and Serlachius2018; Maciukiewicz et al., Reference Maciukiewicz, Marshe, Hauschild, Foster, Rotzinger, Kennedy and Geraci2018; Wang, Wu, et al., Reference Wang, Wu, DeLorenzo and Yang2024; Weintraub et al., Reference Weintraub, Posta, Ichinose, Arevian and Miklowitz2023; Zainal & Newman, Reference Zainal and Newman2024; Zou et al., Reference Zou, Zhang, Xiao, Bai, Li, Liang and Wang2023), and elastic net regularization, a statistical technique that combines the penalties of Least Absolute Shrinkage and Selection Operator (LASSO) and ridge regression to effectively handle multicollinearity and perform variable selection in high-dimensional datasets (Wu et al., Reference Wu, Ren, Chen, Xu, Xu, Zeng and Liu2022).
Table 3. Studies on AI-assisted monitoring in mental health
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20250205021803133-0085:S0033291724003295:S0033291724003295_tab3.png?pub-status=live)
Abbreviations: ADHD: Attention-Deficit/Hyperactivity Disorder; AIMS: Abnormal Involuntary Movement Scale; ALA: α-linolenic acid; ANN: Artificial neural network; AUC: Area under the receiver operating characteristic curve; BARS: Barnes Akathisia Rating Scale; BDI: Beck Depression Inventory; BMI: Body mass index; BSSI-W: Beck Scale for Suicide Ideation, Worst Point; CAD: Coronary artery disease; CART: Classification and regression trees; CBT: Cognitive behavioral therapy; CDSS: Sum of Calgary Depression Scale for Schizophrenia; CGI: Clinical Global Impression; COSTA: Cognitive Style Assessment measuring cognitive distortions; CRS: Cumulative Illness Rating Scale; CRT: Classification and regression tree; DT: Decision tree; EBI: Emotional Breakthrough Index; ECAT: Emotional categorization task; EEG: Electroencephalographic; EMA: Ecological Momentary Assessment; EN: Elastic net; ENRR: Elastic net regularized regression; EREC: Emotional recall task; FERT: Face-based emotional recognition task; FFMQ: Five Factor Mindfulness Questionnaire; FLX: Fluoxetine; fMRI: Functional magnetic resonance imaging; GAD: Generalized anxiety disorder; GAF: Global Assessment of Functioning; GBDT: Gradient-boosted decision trees; GRU: Gated Recurrent Unit; GSRD: Group for the Study of Resistant Depression; HADS: Hospital Anxiety and Depression Scale; HAMD: Hamilton Depression Rating Scale; HDRS: Hamilton Depression Rating Scale; HRSD: Hamilton Rating Scale for Depression; kNN: K-nearest neighbor; LASSO: Least absolute shrinkage and selection operator; LR: Logistics regression; LSTM: Long Short-Term Memory; LTE-Q: List of Threatening Experiences Questionnaire; MADRS: Montgomerye-Åsberg Depression Rating Scale; MAPE: Mean absolute percent error; MDD: Major depressive disorder; MEM: Mixed-effects linear regression models; MHX: Medication history; NLP: Natural language processing; NPRS: Numerical pain rating scale; ODI: Oswestry Disability Index; OCD: Obsessive-compulsive disorder; PAI: Personalized Advantage Index; PANSS: Positive and Negative Syndrome Scale; PDSQ: Psychiatric Diagnostic Screening Questionnaire; PHQ-9: Personal Health Questionnaire-9; PHX: Psychiatric history; PRISE: Patient Rated Inventory of Side Effect; PROMIS: Patient-Reported Outcomes Information System; PRS: Polygenic risk score; PSEQ: Pain Self-Efficacy Questionnaire; PSP: Personal and Social Performance; PSRs: Psychiatric Status Ratings; QIDS-C16: Quick Inventory of Depressive Symptomatology (Clinician-Rated); QIDS-SR16: Quick Inventory of Depressive Symptomatology (Self-assessment); QoL: Quality of life; RCT: randomized controlled trial; RF: Random Forest; rTMS: Repetitive transcranial magnetic stimulation; RMSE: Root mean squared error; RNN: Recurrent neural networks; SCAN: Schedules for Clinical Assessment in Neuropsychiatry; SCS: Suicide Cognitions Scale; SEWIP: Scale for the Multiperspective Assessment of General Change Mechanisms in Psychotherapy; SHAPS: Snaith Hamilton Pleasure Scale; SICD: Structured clinical interview for DSM-IV; sMRI: Structural Magnetic Resonance Imaging; SNPs: Single nucleotide polymorphism; SNRIs: Serotonin-norepinephrine reuptake inhibitors; SPE: Subjective Prognostic Employment Scale; SSRIs: Selective serotonin reuptake inhibitors; SVM: Support vector machine; TCAs: Tricyclic antidepressants; TNF: Tumor necrosis factor; TRD: Treatment-resistant depression; XGBoost: Extreme gradient boosting; YMRS: Young Mania Rating Scale; ω-3 PUFA: Omega-3 polyunsaturated fatty acids.
The most commonly used predictors included depression severity measures using different validated tools such as the Hamilton Depression Rating Scale (HDRS) (Athreya et al., Reference Athreya, Brückl, Binder, John Rush, Biernacka, Frye and Bobo2021; Busk et al., Reference Busk, Faurholt-Jepsen, Frost, Bardram, Vedel Kessing and Winther2020; Choo et al., Reference Choo, Wall, Brodsky, Herzog, Mann, Stanley and Galfalvy2024; Harrer et al., Reference Harrer, Ebert, Kuper, Paganini, Schlicker, Terhorst and Baumeister2023; Wang, Wu, et al., Reference Wang, Wu, DeLorenzo and Yang2024), Montgomerye-Åsberg Depression Rating Scale (MADRS) (Dong et al., Reference Dong, Rokicki, Dwyer, Papiol, Streit, Rietschel and Koutsouleris2024; Iniesta et al., Reference Iniesta, Malki, Maier, Rietschel, Mors, Hauser and Uher2016; Kautzky et al., Reference Kautzky, Dold, Bartova, Spies, Vanicek, Souery and Kasper2018; Lee et al., Reference Lee, Mansur, Brietzke, Kapogiannis, Delgado-Peraza, Boutilier and McIntyre2021; Ricka et al., Reference Ricka, Pellegrin, Fompeyrine, Lahutte and Geoffroy2023), Beck Depression Inventory (BDI) (Choo et al., Reference Choo, Wall, Brodsky, Herzog, Mann, Stanley and Galfalvy2024; Furukawa et al., Reference Furukawa, Debray, Akechi, Yamada, Kato, Seo and Efthimiou2020; Hammelrath et al., Reference Hammelrath, Hilbert, Heinrich, Zagorscak and Knaevelsrud2024; Iniesta et al., Reference Iniesta, Malki, Maier, Rietschel, Mors, Hauser and Uher2016; Van Bronswijk et al., Reference Van Bronswijk, DeRubeis, Lemmens, Peeters, Keefe, Cohen and Huibers2021), Personal Health Questionnaire-9 (PHQ-9) (Furukawa et al., Reference Furukawa, Debray, Akechi, Yamada, Kato, Seo and Efthimiou2020; Hammelrath et al., Reference Hammelrath, Hilbert, Heinrich, Zagorscak and Knaevelsrud2024; Harrer et al., Reference Harrer, Ebert, Kuper, Paganini, Schlicker, Terhorst and Baumeister2023; Scodari et al., Reference Scodari, Chacko, Matsumura and Jacobson2023), Hospital Anxiety and Depression Scale (HADS) (Scodari et al., Reference Scodari, Chacko, Matsumura and Jacobson2023), and Quick Inventory of Depressive Symptoms, 16-item self-report version (QIDS-SR16) (Browning et al., Reference Browning, Kingslake, Dourish, Goodwin, Harmer and Dawson2019; Nie et al., Reference Nie, Vairavan, Narayan, Ye and Li2018; Wang, Wu, et al., Reference Wang, Wu, DeLorenzo and Yang2024). Demographic variables such as age, sex, body mass index (BMI), occupation, marital status, education level, and race were frequent predictors used in AI models (20/40) (Table 3). Medical history and comorbidities, including previous treatment experience, medication history, Axis II or III comorbidities, and concurrent physical illnesses, were also considered (15/20) (Table 3). Psychosocial factors such as stressful life events, socioeconomic status, quality of social, work, and family life were predictors used in several studies (13/40) (Table 3). General activity data such as physical activity, sleep and step data, phone usage (Barrigon et al., Reference Barrigon, Romero-Medrano, Moreno-Muñoz, Porras-Segovia, Lopez-Castroman, Courtet and Baca-Garcia2023; Ricka et al., Reference Ricka, Pellegrin, Fompeyrine, Lahutte and Geoffroy2023; Scodari et al., Reference Scodari, Chacko, Matsumura and Jacobson2023; Zou et al., Reference Zou, Zhang, Xiao, Bai, Li, Liang and Wang2023), and physiological data including heart rate variability, acoustic variables, heart rate, and breathing rate are also utilized as predictors (Carreiro et al., Reference Carreiro, Ramanand, Taylor, Leach, Stapp, Sherestha and Indic2024; Ricka et al., Reference Ricka, Pellegrin, Fompeyrine, Lahutte and Geoffroy2023; Wang, Wu, et al., Reference Wang, Wu, DeLorenzo and Yang2024). Genetic factors, including single nucleotide polymorphisms (SNPs) and gene expression profiles, were examined in some studies (Guilloux et al., Reference Guilloux, Bassi, Ding, Walsh, Turecki, Tseng and Sibille2015; Maciukiewicz et al., Reference Maciukiewicz, Marshe, Hauschild, Foster, Rotzinger, Kennedy and Geraci2018). Cognitive and neurobiological markers, such as electroencephalographic (EEG) measures, functional magnetic resonance imaging (fMRI) data, cognitive performance measures, and speech features, were utilized to assess cognitive functioning, neurobiological alterations, and affective processes (Bailey et al., Reference Bailey, Hoy, Rogasch, Thomson, McQueen, Elliot and Fitzgerald2018; Browning et al., Reference Browning, Kingslake, Dourish, Goodwin, Harmer and Dawson2019; Busk et al., Reference Busk, Faurholt-Jepsen, Frost, Bardram, Vedel Kessing and Winther2020; Dong et al., Reference Dong, Rokicki, Dwyer, Papiol, Streit, Rietschel and Koutsouleris2024; Dougherty et al., Reference Dougherty, Clarke, Atli, Kuc, Schlosser, Dunlop and Ryslik2023; Foster et al., Reference Foster, Mohler-Kuo, Tay, Hothorn and Seibold2019; Hilbert et al., Reference Hilbert, Böhnlein, Meinke, Chavanne, Langhammer, Stumpe and Lueken2024; Iniesta et al., Reference Iniesta, Malki, Maier, Rietschel, Mors, Hauser and Uher2016; Nguyen et al., Reference Nguyen, Chin Fatt, Treacher, Mellema, Cooper, Jha and Montillo2022; Solomonov et al., Reference Solomonov, Lee, Banerjee, Flückiger, Kanellopoulos, Gunning and Alexopoulos2021; Van Bronswijk et al., Reference Van Bronswijk, DeRubeis, Lemmens, Peeters, Keefe, Cohen and Huibers2021; Weintraub et al., Reference Weintraub, Posta, Ichinose, Arevian and Miklowitz2023). Finally, treatment-related variables, such as intervention assignment, treatment group, previous treatment response, and adherence, to pharmacotherapy were also included as predictors (Table 3).
Intervention
Thirteen studies were included in the AI-assisted intervention, with 10 using the AI chatbot method (Danieli et al., Reference Danieli, Ciulli, Mousavi, Silvestri, Barbato, Di Natale and Riccardi2022; Dimeff et al., Reference Dimeff, Jobes, Koerner, Kako, Jerome, Kelley-Brimer and Schak2021; Fulmer et al., Reference Fulmer, Joerin, Gentile, Lakerink and Rauws2018; Karkosz et al., Reference Karkosz, Szymański, Sanna and Michałowski2024; Kleinau et al., Reference Kleinau, Lamba, Jaskiewicz, Gorentz, Hungerbuehler, Rahimi and Kapps2024; Klos et al., Reference Klos, Escoredo, Joerin, Lemos, Rauws and Bunge2021; Ogawa et al., Reference Ogawa, Oyama, Morito, Kobayashi, Yamada, Shinkawa and Hattori2022; Sabour et al., Reference Sabour, Zhang, Xiao, Zhang, Zheng, Wen and Huang2023; Schillings et al., Reference Schillings, Meißner, Erb, Bendig, Schultchen and Pollatos2023; Suharwardy et al., Reference Suharwardy, Ramachandran, Leonard, Gunaseelan, Lyell, Darcy and Judy2023). The remaining three studies involved using AI-based applications for medication reminders and drug identification (Chen et al., Reference Chen, Hsu, Lin, Chen, Hsieh and Ko2023), an AI platform aiding therapists in clinical decision-making and task automation (Sadeh-Sharvit et al., Reference Sadeh-Sharvit, Camp, Horton, Hefner, Berry, Grossman and Hollon2023), and an AI robotic puppy for interactive patient engagement (Yamada et al., Reference Yamada, Akahane, Takeuchi, Miyata, Sato and Gotoh2024) (Table 4).
Table 4. Studies on AI-assisted interventions in mental health
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20250205021803133-0085:S0033291724003295:S0033291724003295_tab4.png?pub-status=live)
Abbreviations: AI: Artificial Intelligence; BARS: Brief Agitation Rating Scale; BDI-II: Beck Depression Inventory-II; CBT: Cognitive Behavioral Therapy; CESD-R: Center for Epidemiologic Studies Depression Scale Revised; CgA: Chromogranin A; CSDD: Cornell Scale for Symptoms of Depression in Dementia; EPDS: Edinburgh Postnatal Depression Scale; GAD-7: General Anxiety Disorders-7 scales; OLBI: Oldenburg Burnout Inventory; OSI: Occupational Stress Indicator; PANAS: Positive and Negative Affect Scale; PANSS: Positive and Negative Syndrome Scale; PD: Parkinson’s Disease; PHQ-8: Patient Health Questionnaire-8; PHQ-9: Patient Health Questionnaire-9; PSS: Perceived Stress Scale; PSWQ: Penn State Worry Questionnaire; QIDS: Quick Inventory of Depressive Symptomatology Self-Report; SCL-90-R: Symptom Checklist-90-Revised; SIDQ: Safety and Imminent Distress Questionnaire; SRCS: Suicide-Related Coping Scale; STAI: State–Trait Anxiety Inventory; SWLS: Satisfaction With Life Scale; TEO: Therapy Empowerment Opportunity; WHO-5: 5-item WHO Well-Being Index.
The studies compared the treatment effectiveness of AI-assisted interventions against traditional interventions (Danieli et al., Reference Danieli, Ciulli, Mousavi, Silvestri, Barbato, Di Natale and Riccardi2022; Dimeff et al., Reference Dimeff, Jobes, Koerner, Kako, Jerome, Kelley-Brimer and Schak2021; Ogawa et al., Reference Ogawa, Oyama, Morito, Kobayashi, Yamada, Shinkawa and Hattori2022; Sadeh-Sharvit et al., Reference Sadeh-Sharvit, Camp, Horton, Hefner, Berry, Grossman and Hollon2023; Schillings et al., Reference Schillings, Meißner, Erb, Bendig, Schultchen and Pollatos2023; Suharwardy et al., Reference Suharwardy, Ramachandran, Leonard, Gunaseelan, Lyell, Darcy and Judy2023; Yamada et al., Reference Yamada, Akahane, Takeuchi, Miyata, Sato and Gotoh2024) or psychoeducation (Fulmer et al., Reference Fulmer, Joerin, Gentile, Lakerink and Rauws2018; Karkosz et al., Reference Karkosz, Szymański, Sanna and Michałowski2024; Kleinau et al., Reference Kleinau, Lamba, Jaskiewicz, Gorentz, Hungerbuehler, Rahimi and Kapps2024; Klos et al., Reference Klos, Escoredo, Joerin, Lemos, Rauws and Bunge2021), and other studies compared mixed treatment and no treatment (Chen et al., Reference Chen, Hsu, Lin, Chen, Hsieh and Ko2023; Danieli et al., Reference Danieli, Ciulli, Mousavi, Silvestri, Barbato, Di Natale and Riccardi2022; Sabour et al., Reference Sabour, Zhang, Xiao, Zhang, Zheng, Wen and Huang2023). Subjects were adults with depressive, anxiety, schizophrenia, stress, and/or suicidal symptoms, with or without an established diagnosis, with a total sample size of 2816. The most prevalent mental health conditions treated with AI-assisted inventions were depression and anxiety. PHQ-8 or -9 and GAD-7 were common outcome measures evaluated in AI-assisted intervention studies (Table 4).
Quality assessment
The quality of the studies was assessed using the NIH assessment tools. Fifty studies were rated as good, 34 studies as fair, and one study as poor (Table 5). Within the diagnosis domain, there was one controlled intervention study, 15 observational cohort and cross-sectional studies, and 16 case-control studies, 18 rated as good, 13 as fair, and one as poor. One article falls under both the diagnosis and monitoring domains, classified as observational cohort and cross-sectional studies, and assessed as fair. Regarding the intervention domain, all 13 studies were controlled intervention studies, with five rated as good and eight as fair (Table 5).
Table 5. Result of the individual components of the quality assessment of the included studies
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20250205021803133-0085:S0033291724003295:S0033291724003295_tab5.png?pub-status=live)
Abbreviations: Y: yes; N: no; NA: not applicable; NR: not reported.
Discussion
Among the 85 articles included, 58.8% were rated as good, and 40% were rated as fair. Within the monitoring domain, 69% of the articles were rated as good, while in the diagnosis domain, 56% were rated as good. In the intervention domain, 38% of the articles were rated as good. In controlled intervention studies, the main factors impacting the quality of the articles include the absence of reporting adherence and drop-out rates, as well as insufficient description and implementation of concealment and blinding methods, particularly within the intervention domain. For observational cohort and cross-sectional studies, the main factors impacting quality were the lack of reporting or insufficient information regarding the participation rate, follow-up loss, blinding, sample size justification, and adjustment for key potential confounding variables. In case-control studies, the quality was primarily affected by the absence of reporting or insufficient information on sample size justification, random selection of study participants, and blinding of exposure assessors. In pre–post studies with no control group, the quality was significantly influenced by the lack of information on blinded outcome assessors, follow-up loss, and failure to utilize an interrupted time-series design. These issues stem from the fact that some AI models are trained on existing datasets, which are not always original data and sometimes involve the use of multiple datasets for training, making it challenging to adapt to evaluation frameworks. The overall quality of the studies is good, with 58.8% rated positively, which strengthens the review’s conclusions. However, deficiencies in reporting and methodology, especially in intervention studies where only 38% were rated as good, warrant caution in interpreting the results due to potential biases and limitations.
Studies of machine learning, within the diagnosis domain, demonstrated varying performances in detecting, classifying, and predicting the risk of having a mental health condition. Up to 28 studies reported accuracy in classifying or predicting mental health conditions, ranging from 51% to 97.54% (Table 2). Machine learning models based on a single predictor, such as heart rate variability features (Byun et al., Reference Byun, Kim, Jang, Kim, Choi, Yu and Jeon2019), EEG signals (Du et al., Reference Du, Liu, Balamurugan and Selvaraj2021), MRI data (Marquand et al., Reference Marquand, Mourão-Miranda, Brammer, Cleare and Fu2008), audio spectrogram (Das & Naskar, Reference Das and Naskar2024), or gray matter density (Schnack et al., Reference Schnack, Nieuwenhuis, Van Haren, Abramovic, Scheewe, Brouwer and Kahn2014) already accomplished satisfactory performance with reported accuracies ranging from 68% to 97.54%. Surprisingly, increasing the number of predictors did not increase the predictive power or classification performance of the machine model concerned (Andersson et al., Reference Andersson, Bathula, Iliadis, Walter and Skalkidou2021; Ebdrup et al., Reference Ebdrup, Axelsen, Bak, Fagerlund, Oranje, Raghava and Glenthøj2019; Yang et al., Reference Yang, Chung, Rhee, Park, Kim, Lee and Ahn2024). Designing and selecting different models and variables for prediction can lead to varying outcomes when applied to the same population with different baselines (Manikis et al., Reference Manikis, Simos, Kourou, Kondylakis, Poikonen-Saksela, Mazzocco and Fotiadis2023). Yang et al. (Reference Yang, Chung, Rhee, Park, Kim, Lee and Ahn2024) discovered that notable differences were evident when considering 10 to 15 variables across various variable transformation methods. They found that using more than 15 variables in the model did not significantly improve accuracy. Furthermore, as the number of included variables increases, the practical complexity also rises. Given these conclusions and findings, the significance of targeted variable selection is underscored and warrants further exploration. In general, machine learning demonstrated satisfactory to good performance (accuracy level above 75%) in detecting, classifying, and predicting the risk of having a mental health condition.
Support vector machine is a machine learning model often used for diagnosing mental health conditions, employing a linear decision boundary, or ‘hyperplane’, to effectively separate classes in a dataset (Mohamed et al., Reference Mohamed, Naqishbandi, Bukhari, Rauf, Sawrikar and Hussain2023). It has shown high accuracy in diagnosing anxiety (95%) and depression (95.8%) while achieving lower accuracy for bipolar disorder (69%) and PTSD (69%) among war veterans (Chung & Teo, Reference Chung and Teo2022). Compared to random forest, which yields slightly lower accuracy rates (e.g., 78.6% for depression and anxiety), support vector machine is preferred for its ability to classify both linear and nonlinear data through kernel functions, despite its sensitivity to kernel choice and performance challenges with large or noisy datasets (Chung & Teo, Reference Chung and Teo2022; Mohamed et al., Reference Mohamed, Naqishbandi, Bukhari, Rauf, Sawrikar and Hussain2023). Random forest, a supervised learning technique that combines multiple decision trees via bagging, offers improved accuracy and reduces overfitting but comes with increased computational complexity and limited interpretability (Mohamed et al., Reference Mohamed, Naqishbandi, Bukhari, Rauf, Sawrikar and Hussain2023). Both models face challenges like small sample sizes and inadequate validation that mental health care providers and researchers should be aware of, underscoring the need for high-quality data and more explainable models in mental health research (Chung & Teo, Reference Chung and Teo2022).
Four limitations were identified for the use of AI to diagnose mental health conditions. First, the design of some AI-based diagnostic tools may be biased. Tate et al. (Reference Tate, McCabe, Larsson, Lundström, Lichtenstein and Kuja-Halkola2020) described that the AI model used in their study may have two types of bias: reporting bias, as the outcome and the most important variables were all parent-reported, and bias in the variable importance with the use of mixed data types. Maekawa et al. (Reference Maekawa, Grua, Nakamura, Scazufca, Araya, Peters and Van De Ven2024), Kourou et al. (Reference Kourou, Manikis, Mylona, Poikonen-Saksela, Mazzocco, Pat-Horenczyk and Fotiadis2023), and Jaroszewski et al. (Reference Jaroszewski, Morris and Nock2019) argued that self-reported variables exhibit excessive subjectivity, and Kourou et al. (Reference Kourou, Manikis, Mylona, Poikonen-Saksela, Mazzocco, Pat-Horenczyk and Fotiadis2023) further suggested that although they included a substantial number of variables in their study, the coverage remains insufficient, all of which may lead to biases. Matsuo et al. (Reference Matsuo, Ushida, Emoto, Moriyama, Iitani, Nakamura and Kotani2022) and Chen et al. (Reference Chen, Chan, Li, Chan, Liu, Li and Wing2024) reported that the incorporation of insufficient variables and an imbalanced dataset in developing the AI model may lead to bias. Jacobson et al. (Reference Jacobson, Yom-Tov, Lekkas, Heinz, Liu and Barr2022) mentioned that there may be interrater bias regarding the features provided by their online mental health screening tools. Byun et al. (Reference Byun, Kim, Jang, Kim, Choi, Yu and Jeon2019) stated that the classification algorithms were less accurate for high-dimensional data. Setoyama et al. (Reference Setoyama, Kato, Hashimoto, Kunugi, Hattori, Hayakawa and Kanba2016) mentioned that confounding variables influenced the results of the prediction model. Second, due to confounding variables and specific populations, AI models might reveal correlations between mental health conditions and other variables, yet they are unable to establish causality (Maekawa et al., Reference Maekawa, Grua, Nakamura, Scazufca, Araya, Peters and Van De Ven2024; Simon et al., Reference Simon, Shortreed, Johnson, Rossom, Lynch, Ziebell and Penfold2019; Xu et al., Reference Xu, Thompson, Kerr, Godbole, Sears, Patterson and Natarajan2018). Third, the application of AI-assisted diagnosis included trade-offs between different performance metrics, for instance, between model specificity and sensitivity (Adler et al., Reference Adler, Wang, Mohr and Choudhury2022; Andersson et al., Reference Andersson, Bathula, Iliadis, Walter and Skalkidou2021; Chilla et al., Reference Chilla, Yeow, Chew, Sim and Prakash2022; Cook et al., Reference Cook, Progovac, Chen, Mullin, Hou and Baca-Garcia2016). Fourth, in addition to the constraints mentioned above, including the prevalent use of singular datasets and small sample sizes in studies, as well as technical issues, the AI-assisted diagnostic tools model exhibited limited generalizability (Chen et al., Reference Chen, Chan, Li, Chan, Liu, Li and Wing2024; Geng et al., Reference Geng, An, Fu, Wang and An2023; Kourou et al., Reference Kourou, Manikis, Mylona, Poikonen-Saksela, Mazzocco, Pat-Horenczyk and Fotiadis2023; Lønfeldt et al., Reference Lønfeldt, Olesen, Das, Mora-Jensen, Pagsberg and Clemmensen2023; Maekawa et al., Reference Maekawa, Grua, Nakamura, Scazufca, Araya, Peters and Van De Ven2024). Liang et al. (Reference Liang, Vega, Kong, Deng, Wang, Ma and Li2018) and Kourou et al. (Reference Kourou, Manikis, Mylona, Poikonen-Saksela, Mazzocco, Pat-Horenczyk and Fotiadis2023) acknowledged that their model needed to be validated in other contexts. The above limitations should be considered for the optimal development and higher accuracy of AI-assisted diagnostic tools.
The application of AI to diagnose mental health conditions has brought several challenges. One of these was related to the ability to organize or generalize mental health conditions, major variables in developing the AI model, or standardized measures for the AI-assisted diagnosis. Maglanoc et al. (Reference Maglanoc, Kaufmann, Jonassen, Hilland, Beck, Landrø and Westlye2020) mentioned that the nature of mental disorders was clinically highly heterogeneous and thus might not have the convergence for stratification to develop the AI models. Schnack et al. (Reference Schnack, Nieuwenhuis, Van Haren, Abramovic, Scheewe, Brouwer and Kahn2014) suggested that interpreting the effects of specific brain regions with AI was complicated since the discriminative brain pattern was a description of the cumulative contributions of all features. According to Adler et al. (Reference Adler, Wang, Mohr and Choudhury2022), developing standardized measures of in-the-moment symptoms for continuous remote symptom assessment studies was challenging as it was difficult to align outcome symptom measures across studies for model development. Challenges of AI applied in diagnosis at the model-specific level were also identified. For instance, the design of the convolutional neural network (CNN) model required careful setup adjustment to accommodate input size and training objectives, including the network depth, the number of function mappings, and the kernels for each layer (Du et al., Reference Du, Liu, Balamurugan and Selvaraj2021). Cross-cultural variations and real-world resource constraints pose challenges for implementing clinical recommendations derived from AI models.
Six studies discussed ethical considerations surrounding the application of AI in diagnosing mental health issues (Adler et al., Reference Adler, Wang, Mohr and Choudhury2022; Jacobson et al., Reference Jacobson, Yom-Tov, Lekkas, Heinz, Liu and Barr2022; Jaroszewski et al., Reference Jaroszewski, Morris and Nock2019; Lønfeldt et al., Reference Lønfeldt, Olesen, Das, Mora-Jensen, Pagsberg and Clemmensen2023; Maekawa et al., Reference Maekawa, Grua, Nakamura, Scazufca, Araya, Peters and Van De Ven2024; Tsui et al., Reference Tsui, Shi, Ruiz, Ryan, Biernesser, Iyengar and Brent2021). The primary concern regarding AI models is focused on safeguarding privacy, with all included papers in agreement on the necessity of obtaining informed consent from data sources or patients. Adler et al. (Reference Adler, Wang, Mohr and Choudhury2022), Jacobson et al. (Reference Jacobson, Yom-Tov, Lekkas, Heinz, Liu and Barr2022), and Jaroszewski et al. (Reference Jaroszewski, Morris and Nock2019) also concurred that personally identifiable information should not be recorded or extracted. In addition to using privacy and data protection technologies, it is essential to offer appropriate knowledge support to subjects to address their concerns (Lønfeldt et al., Reference Lønfeldt, Olesen, Das, Mora-Jensen, Pagsberg and Clemmensen2023). Another ethical consideration involves providing assistance to high-risk participants. Tsui et al. (Reference Tsui, Shi, Ruiz, Ryan, Biernesser, Iyengar and Brent2021) suggested that clinicians and patients should be informed about risk data and potential treatment options. However, it is essential to remember that when using AI for diagnostic purposes, respecting a patients’ right to provide informed consent is crucial to prevent data misuse and misinterpretation. Additionally, it is important to avoid overestimating the efficacy of AI models, as this could introduce biases and risks. The challenge of balancing privacy protection when aiding high-risk individuals (e.g., suicidal ideation) remains unresolved. Researchers must proceed with caution, ensuring the legal and ethical utilization of data, even when readily available (Maekawa et al., Reference Maekawa, Grua, Nakamura, Scazufca, Araya, Peters and Van De Ven2024).
In the monitoring domain, most studies have explored the use of predictive models for treatment response in different psychiatric disorders, particularly depression. In terms of performance, AI has provided a variety of algorithms or models, showing promising prospects. Chekroud et al. (Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson, Trivedi and Corlett2016) demonstrated that machine learning achieved moderate performance in predicting treatment outcomes in different treatment groups. Lenhard et al. (Reference Lenhard, Sauer, Andersson, Månsson, Mataix-Cols, Rück and Serlachius2018) reported that machine learning models had good to excellent accuracy in predicting treatment outcomes in internet-delivered CBT for pediatric obsessive-compulsive disorder. Maciukiewicz et al. (Reference Maciukiewicz, Marshe, Hauschild, Foster, Rotzinger, Kennedy and Geraci2018) indicated that machine learning models had moderate performance in predicting treatment response but were less successful in predicting remission. Bailey et al. (Reference Bailey, Hoy, Rogasch, Thomson, McQueen, Elliot and Fitzgerald2018) investigated baseline and week one measures of theta power and connectivity, which showed potential for predicting response to repetitive transcranial magnetic stimulation (rTMS) treatment. Kautzky et al. (Reference Kautzky, Dold, Bartova, Spies, Vanicek, Souery and Kasper2018) examined machine learning techniques and found promising results in predicting treatment-resistant depression. Foster et al. (Reference Foster, Mohler-Kuo, Tay, Hothorn and Seibold2019) showed that treatment combined with CBT and fluoxetine consistently outperformed either therapy alone. Furukawa et al. (Reference Furukawa, Debray, Akechi, Yamada, Kato, Seo and Efthimiou2020) suggested the use of support vector machines to predict treatment outcomes in different treatment arms. Athreya et al. (Reference Athreya, Brückl, Binder, John Rush, Biernacka, Frye and Bobo2021) identified four depressive symptoms and specific thresholds of change that predicted treatment outcomes with an average accuracy of 77%. Bao et al. (Reference Bao, Zhao, Li, Zhang, Wu, Ning and Yang2021) employed machine learning models using genotyping information to predict treatment outcomes of ketamine infusions. Nguyen et al. (Reference Nguyen, Chin Fatt, Treacher, Mellema, Cooper, Jha and Montillo2022) demonstrated that predictive models could offer a possible precision medicine approach for antidepressant selection. Dong et al. (Reference Dong, Rokicki, Dwyer, Papiol, Streit, Rietschel and Koutsouleris2024) proposed that a sequential modeling approach enhances the predictive responsiveness of patients with schizophrenia to rTMS treatment while simultaneously reducing diagnostic complexity. Wang, Wu, et al. (Reference Wang, Wu, DeLorenzo and Yang2024) demonstrated that the machine learning pipeline exhibited high accuracy and area under the receiver operating characteristic curve (AUC) (>0.80) on the training set when predicting treatment responses for patients with major depressive disorder using neuroimaging data, although extensive external validation is required.
These studies have involved a variety of treatment responses, including medication, psychology, and care. The predictive factors for these responses range from basic sociodemographic characteristics and treatment-related variables to genomics, acoustics, and other biomarkers. Amminger et al. (Reference Amminger, Mechelli, Rice, Kim, Klier, McNamara and Schäfer2015) conducted univariate and multivariate analyses, finding that fatty acids and symptoms could predict functional improvement in both the Omega-3 polyunsaturated fatty acids (ω-3 PUFA) and placebo groups. Guilloux et al. (Reference Guilloux, Bassi, Ding, Walsh, Turecki, Tseng and Sibille2015) found that gene expression profiles obtained from blood samples could predict remission and nonremission outcomes in response to citalopram treatment for depression. Iniesta et al. (Reference Iniesta, Malki, Maier, Rietschel, Mors, Hauser and Uher2016) discovered that demographic and clinical variables could predict therapeutic response to escitalopram with clinically significant accuracy. Nie et al. (Reference Nie, Vairavan, Narayan, Ye and Li2018) suggested that machine learning models using clinical and sociodemographic data could predict treatment-resistant depression. Browning et al. (Reference Browning, Kingslake, Dourish, Goodwin, Harmer and Dawson2019) found that cognitive and symptomatic measures were useful in guiding antidepressant treatment. Rajpurkar et al. (Reference Rajpurkar, Yang, Dass, Vale, Keller, Irvin and Williams2020) identified certain symptoms that exhibited high discriminative performance in predicting treatment outcomes, with baseline symptom severity being a critical predictor. Busk et al. (Reference Busk, Faurholt-Jepsen, Frost, Bardram, Vedel Kessing and Winther2020) found that historical mood was the most important predictor of future mood and that different mood scores exhibit correlation. Jacobson et al. (Reference Jacobson, Yom-Tov, Lekkas, Heinz, Liu and Barr2022) found that online screening for depression influenced help-seeking behavior, suicidal ideation, suicidal intent, and identified individuals who may benefit from treatment interventions. Dougherty et al. (Reference Dougherty, Clarke, Atli, Kuc, Schlosser, Dunlop and Ryslik2023) suggested that treatment response for patients with treatment-resistant depression to psilocybin can be accurately predicted using a logistic regression model that incorporates Emotional Breakthrough Index metrics, natural language processing metrics, and treatment arm data. Harrer et al. (Reference Harrer, Ebert, Kuper, Paganini, Schlicker, Terhorst and Baumeister2023) found that a multivariate tree learning model predicts that patients with lower back pain and moderate depression, coupled with relatively low pain self-efficacy, benefit the most from an internet-based depression intervention. Jankowsky et al. (Reference Jankowsky, Krakau, Schroeders, Zwerenz and Beutel2024) highlighted that treatment-related variables play a pivotal role in predicting treatment response in naturalistic inpatient samples with anxious and depressive symptoms. Scodari et al. (Reference Scodari, Chacko, Matsumura and Jacobson2023) discovered that patients with depressive symptoms who underwent stepped care were more likely to reduce PHQ-9 scores if they had high PHQ-9 but low HADS-Anxiety scores at baseline, a low number of chronic illnesses, and an internal locus of control. Wang, Wu, et al. (Reference Wang, Wu, DeLorenzo and Yang2024) suggested that speech features, particularly energy parameters, serve as precise and objective indicators for tracking biofeedback therapy response and predicting efficacy for college students with symptoms of anxiety or depression. Hammelrath et al. (Reference Hammelrath, Hilbert, Heinrich, Zagorscak and Knaevelsrud2024) emphasized that therapeutic alliance and early symptom change are crucial predictors for anticipating non-response to a 6-week online depression program. Zainal and Newman (Reference Zainal and Newman2024) identified predictors, such as higher anxiety severity, elevated trait perseverative cognition, lower set-shifting deficits, older age, and stronger trait mindfulness, for individuals with generalized anxiety disorder who benefit from mindfulness ecological momentary intervention.
Using AI in predicting treatment response or prognosis of mental health disorders has limitations related to data availability, model performance, and external validity. First, data quality, cost, and sample size can affect the performance of AI models (Athreya et al., Reference Athreya, Brückl, Binder, John Rush, Biernacka, Frye and Bobo2021; Bao et al., Reference Bao, Zhao, Li, Zhang, Wu, Ning and Yang2021; Brandt et al., Reference Brandt, Ritter, Schneider-Thoma, Siafis, Montag, Ayrilmaz and Stuke2023; Dougherty et al., Reference Dougherty, Clarke, Atli, Kuc, Schlosser, Dunlop and Ryslik2023; Maciukiewicz et al., Reference Maciukiewicz, Marshe, Hauschild, Foster, Rotzinger, Kennedy and Geraci2018; Wang, Wu, et al., Reference Wang, Wu, DeLorenzo and Yang2024). Second, overfitting is a common problem, and the trade-off between data quality and model robustness can affect model performance (Bailey et al., Reference Bailey, Hoy, Rogasch, Thomson, McQueen, Elliot and Fitzgerald2018; Busk et al., Reference Busk, Faurholt-Jepsen, Frost, Bardram, Vedel Kessing and Winther2020; Scodari et al., Reference Scodari, Chacko, Matsumura and Jacobson2023; Susai et al., Reference Susai, Mongan, Healy, Cannon, Cagney, Wynne and Cotter2022). Third, AI models built on diverse populations and interventions, selecting a variety of diverse predictor variables, not only make it challenging to compare or replicate across different datasets, limiting the assessment of specific predictive factors, but may also fail to generalize to new samples, treatment settings, and populations (Brandt et al., Reference Brandt, Ritter, Schneider-Thoma, Siafis, Montag, Ayrilmaz and Stuke2023; Chekroud et al., Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson, Trivedi and Corlett2016; Choo et al., Reference Choo, Wall, Brodsky, Herzog, Mann, Stanley and Galfalvy2024; Dong et al., Reference Dong, Rokicki, Dwyer, Papiol, Streit, Rietschel and Koutsouleris2024; Dougherty et al., Reference Dougherty, Clarke, Atli, Kuc, Schlosser, Dunlop and Ryslik2023; Rajpurkar et al., Reference Rajpurkar, Yang, Dass, Vale, Keller, Irvin and Williams2020; Ricka et al., Reference Ricka, Pellegrin, Fompeyrine, Lahutte and Geoffroy2023). Researchers should take steps to mitigate these limitations, such as using standardized experimental protocols and platforms, collecting complete data over an extended period, and testing the generalizability of AI models in routine clinical settings (Busk et al., Reference Busk, Faurholt-Jepsen, Frost, Bardram, Vedel Kessing and Winther2020; Foster et al., Reference Foster, Mohler-Kuo, Tay, Hothorn and Seibold2019).
Several challenges were identified in developing and applying treatment outcome prediction models. Browning et al. (Reference Browning, Kingslake, Dourish, Goodwin, Harmer and Dawson2019) noted the difficulty in predicting remission of a mental health condition when the condition was less common. Busk et al. (Reference Busk, Faurholt-Jepsen, Frost, Bardram, Vedel Kessing and Winther2020) identified the challenge of collecting complete histories over a longer time for a better prediction model, and the challenge of developing a real-time forecast system due to the intervention, which can potentially change the outcome and future training data. Chekroud et al. (Reference Chekroud, Zotti, Shehzad, Gueorguieva, Johnson, Trivedi and Corlett2016) pointed out identification difficulties regarding the variables to be used in the prediction model. Choo et al. (Reference Choo, Wall, Brodsky, Herzog, Mann, Stanley and Galfalvy2024) emphasized that AI models may lack transparency regarding how input features influence predictions, thereby complicating assessments of predictor importance and causal inference. This presents a dual challenge for bias analysis and ethical considerations. Additionally, Guilloux et al. (Reference Guilloux, Bassi, Ding, Walsh, Turecki, Tseng and Sibille2015) mentioned the challenge of directly applying predictive models to different test studies due to cross-laboratory variability in probe designs from different experimental protocols and different array platforms. These interplatform differences underscore the complexity of real-world scenarios, necessitating larger sample sizes and multicenter experiments in future research. However, this approach also brings about heightened risks of data leakage (Hilbert et al., Reference Hilbert, Böhnlein, Meinke, Chavanne, Langhammer, Stumpe and Lueken2024). These challenges highlight the importance of continued research and maintaining ethical integrity to improve the accuracy and generalizability of outcome prediction models.
Ethical considerations relating to the use of AI for monitoring mental health and predicting treatment response or prognosis of mental health disorders share key points with AI for diagnosis, emphasizing the critical importance of safeguarding patient privacy, informed consent, and autonomy. Informed consent stands as a fundamental element in these domains. One ethical concern was related to the data collected from electronic devices such as smartphones. These data should be stored on a secure server to ensure confidentiality and protect the participants’ privacy (Busk et al., Reference Busk, Faurholt-Jepsen, Frost, Bardram, Vedel Kessing and Winther2020). Furthermore, the protocol for using AI in mental health should be approved by the ethics boards of all centers involved to ensure the safety and privacy of the participants (Iniesta et al., Reference Iniesta, Malki, Maier, Rietschel, Mors, Hauser and Uher2016). Using AI for monitoring can be highly beneficial for patients, especially those at high risk, such as individuals prone to suicide (Barrigon et al., Reference Barrigon, Romero-Medrano, Moreno-Muñoz, Porras-Segovia, Lopez-Castroman, Courtet and Baca-Garcia2023; Choo et al., Reference Choo, Wall, Brodsky, Herzog, Mann, Stanley and Galfalvy2024). However, implementing this while ensuring patient privacy is maintained is a crucial element that future ethical considerations must address. Simultaneously, researchers must be mindful of the opacity of AI and the potential for bias, exercising caution against overly exaggerating the capabilities of AI (Choo et al., Reference Choo, Wall, Brodsky, Herzog, Mann, Stanley and Galfalvy2024).
AI chatbots were used to investigate the effectiveness of AI-assisted treatment (Danieli et al., Reference Danieli, Ciulli, Mousavi, Silvestri, Barbato, Di Natale and Riccardi2022; Dimeff et al., Reference Dimeff, Jobes, Koerner, Kako, Jerome, Kelley-Brimer and Schak2021; Fulmer et al., Reference Fulmer, Joerin, Gentile, Lakerink and Rauws2018; Karkosz et al., Reference Karkosz, Szymański, Sanna and Michałowski2024; Kleinau et al., Reference Kleinau, Lamba, Jaskiewicz, Gorentz, Hungerbuehler, Rahimi and Kapps2024; Klos et al., Reference Klos, Escoredo, Joerin, Lemos, Rauws and Bunge2021; Ogawa et al., Reference Ogawa, Oyama, Morito, Kobayashi, Yamada, Shinkawa and Hattori2022; Sabour et al., Reference Sabour, Zhang, Xiao, Zhang, Zheng, Wen and Huang2023; Schillings et al., Reference Schillings, Meißner, Erb, Bendig, Schultchen and Pollatos2023; Suharwardy et al., Reference Suharwardy, Ramachandran, Leonard, Gunaseelan, Lyell, Darcy and Judy2023). AI chatbots showed inconsistent performance in treating mental health conditions. Dimeff et al. (Reference Dimeff, Jobes, Koerner, Kako, Jerome, Kelley-Brimer and Schak2021) and Fulmer et al. (Reference Fulmer, Joerin, Gentile, Lakerink and Rauws2018) found that AI chatbots contributed to significant improvements in reducing suicidal, depressive, or anxiety symptoms. Sabour et al. (Reference Sabour, Zhang, Xiao, Zhang, Zheng, Wen and Huang2023) observed significantly greater improvement in symptoms of depression and negative affect with the chatbot. Karkosz et al. (Reference Karkosz, Szymański, Sanna and Michałowski2024) found that the chatbot reduced anxiety and depressive symptoms and decreased perceived loneliness among high-frequency users. Kleinau et al. (Reference Kleinau, Lamba, Jaskiewicz, Gorentz, Hungerbuehler, Rahimi and Kapps2024) reported a significant positive effect of Vitalk in reducing anxiety and depression.
Ogawa et al. (Reference Ogawa, Oyama, Morito, Kobayashi, Yamada, Shinkawa and Hattori2022) showed that their AI chatbot only made small improvements in depression. Klos et al. (Reference Klos, Escoredo, Joerin, Lemos, Rauws and Bunge2021) indicated that their AI chatbot was only effective for anxiety but not for depressive symptoms. Danieli et al. (Reference Danieli, Ciulli, Mousavi, Silvestri, Barbato, Di Natale and Riccardi2022) concluded that a combination of in-person and AI treatment was more effective in reducing stress and anxiety than an AI chatbot alone. Schillings et al. (Reference Schillings, Meißner, Erb, Bendig, Schultchen and Pollatos2023) did not find the chatbot to be more effective than usual care in reducing stress and enhancing subjective well-being. Suharwardy et al. (Reference Suharwardy, Ramachandran, Leonard, Gunaseelan, Lyell, Darcy and Judy2023) concluded that the potential of the chatbot to reduce depressive symptoms in the general obstetric population was limited. AI chatbots generally use natural language processing techniques to understand and reply to questions from humans (Lalwani et al., Reference Lalwani, Bhalotia, Pal, Bisen and Rathod2018). In recent years, there has been a rise of generative AI chatbots, such as ChatGPT by OpenAI, which use transformer neural networks and large-scale language models (Atallah et al., Reference Atallah, Banda, Banda and Roeck2023; Jo et al., Reference Jo, Epstein, Jung and Kim2023). These generative AI chatbots are now being tested and used in various application domains, such as the service industry, the creative industry, banking and finance, and even healthcare. However, further investigation is required to understand the use of these generative AI chatbots for the intervention of mental health disorders. Apart from chatbots, AI has shown significant potential in various applications within the medical field. Three additional studies employed AI for intervention assistance: Chen et al. (Reference Chen, Hsu, Lin, Chen, Hsieh and Ko2023) used AI for medication reminders and identification, resulting in significantly improved medication adherence. Sadeh-Sharvit et al. (Reference Sadeh-Sharvit, Camp, Horton, Hefner, Berry, Grossman and Hollon2023) used AI to aid therapists in mental health services, leading to increased patient session attendance. Yamada et al. (Reference Yamada, Akahane, Takeuchi, Miyata, Sato and Gotoh2024) introduced an AI puppy into a protective isolation unit, which notably reduced patients’ CgA and cortisol levels while increasing oxytocin production.
In addition, machine learning was found to be effective both in terms of treatment modalities and frequency recommendations for depression. Bruijniks et al. (Reference Bruijniks, Van Bronswijk, DeRubeis, Delgadillo, Cuijpers and Huibers2022) showed that stratified care with a machine learning model was efficacious for treatment selection. Delgadillo et al. (Reference Delgadillo, Ali, Fleck, Agnew, Southgate, Parkhouse and Barkham2022) reported that machine learning enhanced recommendations for a minority of participants. Furukawa et al. (Reference Furukawa, Debray, Akechi, Yamada, Kato, Seo and Efthimiou2020) indicated that machine learning was able to predict the optimal frequency of CBT sessions. The basic approach in these studies is to first collect patient data, identify the predictive features, and then build the machine learning model that can predict the treatment modalities and frequency recommendations. These studies often leverage simple yet interpretable regression and classification models, including linear regression, logistic regression, decision trees, and support vector machines, instead of complex neural network models (Bruijniks et al., Reference Bruijniks, Van Bronswijk, DeRubeis, Delgadillo, Cuijpers and Huibers2022; Furukawa et al., Reference Furukawa, Debray, Akechi, Yamada, Kato, Seo and Efthimiou2020).
There are some limitations to the use of AI in mental health interventions, as mentioned by the authors of some of the studies included. Fulmer et al. (Reference Fulmer, Joerin, Gentile, Lakerink and Rauws2018) summarized the following four limitations of AI chatbots: (1) emotional identification was found to be limited to language and the chatbots did not consider facial expressions, body cues, and tone of voice; (2) the interaction process was sometimes unnatural; (3) AI chatbots sometimes misunderstood replies from the users; and (4) the content provided by AI chatbots was irrelevant or not interactive enough. Klos et al. (Reference Klos, Escoredo, Joerin, Lemos, Rauws and Bunge2021) proposed two more limitations of AI chatbots: (5) digital interventions may have limited capacity to capture the motivation and attention of users, and (6) easy access may result in a lower level of commitment to the use of AI. Finally, the requirements for devices equipped with AI may limit the participant pool toward urban areas and higher education levels, introducing bias and potentially reducing the generalizability of AI deployment (Chen et al., Reference Chen, Hsu, Lin, Chen, Hsieh and Ko2023; Kleinau et al., Reference Kleinau, Lamba, Jaskiewicz, Gorentz, Hungerbuehler, Rahimi and Kapps2024). The above limitations suggest new directions for future improvement in the design and functions of AI-assisted interventions.
Several studies raised the concern that the application of AI-assisted intervention was sometimes challenging. Dimeff et al. (Reference Dimeff, Jobes, Koerner, Kako, Jerome, Kelley-Brimer and Schak2021) reported that successful implementation depended on the willingness of the staff involved to incorporate AI into the workflow. Klos et al. (Reference Klos, Escoredo, Joerin, Lemos, Rauws and Bunge2021) mentioned the difficulty of accurate translation when applying an established database and algorithm of an AI chatbot to another language. Schillings et al. (Reference Schillings, Meißner, Erb, Bendig, Schultchen and Pollatos2023) proposed that risks such as safety, data privacy, biases, limited empathy, and potential hallucinations in comparison to human interactions require in-depth discussion. All of these challenges may be reduced through greater popularization of AI, supported by evidence-based research, experience in database expansion, technological advancements, and more robust regulation.
In addition to the ethical considerations aligned with the diagnosis and monitoring domains, certain issues discussed in the studies on AI-assisted interventions were particularly important for the treatment of suicidal individuals in the emergency department. Dimeff et al. (Reference Dimeff, Jobes, Koerner, Kako, Jerome, Kelley-Brimer and Schak2021) reported several ethical practices: (1) an advisory group of people with lived experience with suicide should be involved in developing the use of the AI model, (2) interventions should be drawn upon well-established and evidence-based practice for suicide prevention, (3) a timed protocol stimulation test should be conducted, (4) all procedures should be approved by a board review, and (5) external monitoring should be provided by an independent board of recognized suicide experts. Both Fulmer et al. (Reference Fulmer, Joerin, Gentile, Lakerink and Rauws2018) and Klos et al. (Reference Klos, Escoredo, Joerin, Lemos, Rauws and Bunge2021) agreed that (6) crisis support should be provided if users express suicidal ideation, while (7) users should be encouraged to end the chat and reach out for professional help.
This systematic review highlighted the potential of AI in the diagnosis, monitoring, and intervention of mental health disorders. The studies reviewed demonstrated that machine learning algorithms can accurately detect and predict mental health conditions using various predictors, including demographic information, socioeconomic data, clinical history, psychometric data, medical scans, biomarkers, and semantic content. The review also indicated that AI can effectively monitor treatment response and predict the ongoing prognosis of mental health disorders. The studies reviewed in the intervention domain showed that AI-assisted interventions, in the form of chatbots, had the potential to be an effective alternative to traditional in-person interventions and psychoeducation eBooks. The use of AI for intervention assistance in the medical field holds immense promise and warrants further in-depth exploration and research.
The findings of this review can inform AI developers and healthcare practitioners about the development and the choice of AI-based tools and interventions, which can improve the accuracy of mental health diagnosis, treatment, and outcomes. Future directions should focus on developing more robust and diverse datasets and improving the interpretability and transparency of AI models to facilitate their integration into clinical practice.
Two important applications of AI that fall outside the inclusion criteria were discovered during the study selection process of this systematic review. It is important to acknowledge that studies utilizing AI to predict improvements in mental health or symptom remission prior to treatment initiation may still be of significant value for future research. If the accuracy and reliability of these predictions are high, they could serve as useful tools to assist in treatment decision-making. Second, machine learning was adopted in predicting treatment outcomes to facilitate the choice of treatment modality (Delgadillo et al., Reference Delgadillo, Ali, Fleck, Agnew, Southgate, Parkhouse and Barkham2022; Kleinerman et al., Reference Kleinerman, Rosenfeld, Benrimoh, Fratila, Armstrong, Mehltretter and Kapelner2021) or frequency (Bruijniks et al., Reference Bruijniks, Van Bronswijk, DeRubeis, Delgadillo, Cuijpers and Huibers2022). For instance, Kleinerman et al. (Reference Kleinerman, Rosenfeld, Benrimoh, Fratila, Armstrong, Mehltretter and Kapelner2021) found that AI was effective in predicting the treatment outcome prior to treatment initiation and in promoting personalized decision-making. Up to 23% of the participants with depressive symptoms achieved remission earlier without multiple treatment attempts than those in random treatment allocation. It was an impactful study that supported the use of AI in treatment recommendations for better treatment allocation and higher efficiency of treatments. AI was found to have a broader application than the focus of our systematic review, as defined by the inclusion criteria. General limits of AI Common issues observed among included studies were insufficient sample sizes and a lack of diversity in datasets. These limitations lead to imbalanced results and fixed features that compromise model performance. Insufficient diversity can introduce bias given the specific (i.e., limited representation or homogeneous) populations from which the data is drawn while missing data often results in incompleteness, inconsistency, or inaccuracy (Noorbakhsh-Sabet et al., Reference Noorbakhsh-Sabet, Zand, Zhang and Abedi2019). Such challenges are compounded by noisy and high-dimensional data, making accurate predictions difficult (Noorbakhsh-Sabet et al., Reference Noorbakhsh-Sabet, Zand, Zhang and Abedi2019). Predictive models also suffer from low input data quality, inadequately representing diverse populations, which hinders their effectiveness (Tejavibulya et al., Reference Tejavibulya, Rolison, Gao, Liang, Peterson, Dadashkarimi and Scheinost2022). Additionally, deep learning models, although capable of reducing dimensionality, are prone to overfitting in contexts with limited training samples, further limiting their predictive capabilities (Noorbakhsh-Sabet et al., Reference Noorbakhsh-Sabet, Zand, Zhang and Abedi2019). Recognizing and addressing these issues are crucial for optimizing the clinical utility of AI in mental health. Second, the inclusion of singular, excessive, or incomplete variables, as well as the presence of confounding variables, may introduce bias in the analysis. Both the outcome and predictor variables often share common methods, necessitating a strategy to minimize redundancy (Chahar et al., Reference Chahar, Dubey and Narang2021). AI models require transparency and articulation to manage complex interactions (Jha et al., Reference Jha, Awasthi, Kumar, Kumar and Sethi2021). Since mental health variables exhibit intricate dependencies with potential confounders, it is essential to use data-driven structural learning of Bayesian networks to extend association analyses (Jha et al., Reference Jha, Awasthi, Kumar, Kumar and Sethi2021). This approach can offer advantages over black-box machine learning and traditional statistical methods by enabling the discovery and modeling of confounding factors transparently (Jha et al., Reference Jha, Awasthi, Kumar, Kumar and Sethi2021). Standard statistical methods struggle to analyze interactions among numerous variables, whereas structured learning can effectively identify mediation, confounding, and intercausal effects (Jha et al., Reference Jha, Awasthi, Kumar, Kumar and Sethi2021). Confounding bias is a notable concern. Confounding arises when a variable influences both the exposure and the outcome, generating misleading associations (Prosperi et al., Reference Prosperi, Guo, Sperrin, Koopman, Min, He and Bian2020). Observational data, when adjusted for measured confounding – such as through propensity score matching – can help mimic randomized treatment assignment, particularly when using detailed electronic medical records (Prosperi et al., Reference Prosperi, Guo, Sperrin, Koopman, Min, He and Bian2020).
Third, some studies lacked effective external validation, which could impact the reliability and generalizability of their findings. External validation in AI mental health research is still rare (Tornero-Costa et al., Reference Tornero-Costa, Martinez-Millana, Azzopardi-Muscat, Lazeri, Traver and Novillo-Ortiz2023). Designing appropriate trials for AI applications is challenging due to funding and resource constraints (Tornero-Costa et al., Reference Tornero-Costa, Martinez-Millana, Azzopardi-Muscat, Lazeri, Traver and Novillo-Ortiz2023). As a result, retrospective data are often used, raising concerns about its suitability for AI development (Tornero-Costa et al., Reference Tornero-Costa, Martinez-Millana, Azzopardi-Muscat, Lazeri, Traver and Novillo-Ortiz2023). Furthermore, some authors may overlook the need for a robust preprocessing pipeline (Tornero-Costa et al., Reference Tornero-Costa, Martinez-Millana, Azzopardi-Muscat, Lazeri, Traver and Novillo-Ortiz2023). Consequently, while acknowledging poor model performance, authors often suggest trial-based improvements instead of addressing statistical biases in model development, which could save time and costs (Tornero-Costa et al., Reference Tornero-Costa, Martinez-Millana, Azzopardi-Muscat, Lazeri, Traver and Novillo-Ortiz2023). Therefore, before deploying pretrained models, rigorous external validation is necessary to ensure generalizability, which involves testing with independent samples (He et al., Reference He, Sakuma, Kishi, Li, Matsunaga, Tanihara and Ota2024). A model should demonstrate excellent generalizability before being considered for commercial use (He et al., Reference He, Sakuma, Kishi, Li, Matsunaga, Tanihara and Ota2024). Fourth, balancing different performance metrics poses a challenge in evaluating the effectiveness of AI models consistently. Finally, issues such as the opacity of AI, potential bias or exaggerated predictions, cross-cultural differences, resource constraints, ethical considerations, and technical limitations make the seamless translation of AI findings into real-world applications challenging.
Real-world applications and future directions
While AI still faces numerous limitations in diagnosis, monitoring, and intervention, it holds vast potential in the healthcare sector, particularly in mental health. AI applied in mental health has more potential than in other healthcare modalities because it allows for a more objective redefinition of psychiatric illnesses, surpassing traditional diagnostic frameworks like the DSM-5 (Alhuwaydi, Reference Alhuwaydi2024). Additionally, through advanced techniques such as multimodal emotion recognition and machine learning, AI can facilitate early diagnosis and personalized intervention strategies that adapt to individual patients’ needs, addressing both the obstacles and opportunities in mental healthcare (Alhuwaydi, Reference Alhuwaydi2024). To effectively implement AI in clinical settings, researchers and practitioners should focus on developing larger, more diverse datasets and systematic bias detection and correction methods. It is crucial to ensure high data quality and balanced performance metrics to enhance model reliability. Continuous monitoring of AI innovations and maintaining transparency can help to overcome inherent technical constraints (Kiseleva et al., Reference Kiseleva, Kotzinos and De Hert2022). Additionally, the interactivity of chatbots and the adoption of AI technologies must be prioritized for effective interventions. Maintaining ethical integrity is of paramount importance. Regulatory bodies must guarantee patient privacy, require informed consent, and enhance data security to safeguard ethical standards in AI applications.
Researchers and practitioners should also address the common limits of AI, such as insufficient sample size, lack of diversity, and data quality issues, which can undermine predictive accuracy. Using data-driven structural learning approaches can help to manage complex relationships and minimize confounding biases that may generate misleading results. Prioritizing transparency and articulation in AI models is essential for building trust and ensuring clinical utility. Rigorous external validation is necessary before deploying any pre-trained AI models, as this confirms their generalizability across diverse populations.
Limitations of the review
This systematic review has some limitations. First, excluding conference papers may have limited the review’s scope, potentially obviating important advancements in AI tools for mental health presented at conferences. Second, the lack of critical analysis of the AI models used in reviewed studies hinders a comprehensive evaluation of their efficacy and reliability in mental health care settings. Third, the exclusion of studies published in languages other than English limits the generalizability of this synthesis as it disregards potentially relevant research findings that may contribute unique insights, methodologies, or outcomes specific to the cultural context of diverse populations.
Conclusions
This systematic review underscores the significant potential of AI to transform the landscape of mental health diagnosis, monitoring, and intervention. With over half of the studies assessed rated as good in quality, AI methodologies have demonstrated commendable accuracy in detecting and predicting mental health conditions across diverse datasets. Notably, machine learning algorithms showed efficacy in classifying various mental disorders and predicting treatment responses, suggesting a promising pathway for personalized mental health care. However, the review also highlighted critical limitations, including methodological inconsistencies, issues with data quality and diversity, and ethical challenges related to privacy and informed consent. These factors necessitate careful consideration in the development and application of AI tools in clinical practice. The findings inform AI developers and mental health practitioners, advocating for further exploration of data-driven approaches, improved model transparency, and rigorous external validation. Future research should aim to bridge existing gaps and enhance the robustness of AI applications in mental health to ensure they meet the diverse needs of patients effectively and ethically.
Funding
This study was not funded.
Competing interest
The authors have no conflicts of interest to declare.