Introduction
Schizophrenia is a chronic and severe mental illness affecting 24 million people globally [Reference Solmi, Seitidis, Mavridis, Correll, Dragioti and Guimond1], accounting for approximately 1% of the population [Reference Janoutová, Janácková, Serý, Zeman, Ambroz and Kovalová2]. On average, people with schizophrenia have a life expectancy 15 years shorter than the general population, and a risk of death by suicide of up to 10% [Reference Hjorthøj, Stürup, McGrath and Nordentoft3]. Although symptoms and their severity vary widely between individuals, contributing to the complexities of treatment, one frequently experienced symptom is auditory verbal hallucinations (AVHs), i.e., “hearing voices,” present in approximately 70% of people with schizophrenia [Reference McCarthy-Jones4]. AVHs are a key contributor to schizophrenia-related morbidity and mortality by reducing quality of life and elevating the risk of violence and suicide [Reference Volavka, Laska, Baker, Meisner, Czobor and Krivelevich5, Reference Fujita, Takahashi, Nishida, Okumura, Ando and Kawano6]. Therefore, managing AVHs is critical to reduce the morbidity of this illness, with implications for potential uses in other conditions in which AVHs are a prominent symptom, such as in schizoaffective disorder or bipolar affective disorder [Reference Bassett, Boyce, Lyndon, Mulder, Parker and Porter7].
The current first-line treatment for schizophrenia is antipsychotic medication [Reference Maroney8], which affords the majority of patients with effective symptom control and overall improved quality of life [Reference Haddad and Correll9], for 30% of patients, these AVHs persist despite treatment [Reference Meltzer10–Reference Essock, Hargreaves, Dohm, Goethe, Carver and Hipshman12]. People with schizophrenia can often lack insight into their condition, impacting medication adherence [Reference Phelan and Sigala13], which is compounded by the potential for severe adverse effects of antipsychotics. Some serious side effects include tardive dyskinesia [Reference Rummel-Kluge, Komossa, Schwarz, Hunger, Schmid and Kissling14], prolonged QT syndrome [Reference Haddad and Anderson15], and agranulocytosis, which is of particular concern when using clozapine, an antipsychotic reserved for treatment-resistant schizophrenia [Reference Mijovic and MacCabe16].
Although antipsychotics are an essential part of treatment regimens for those with psychotic illness, the limited efficacy of antipsychotics in treating AVHs for some patients underscores a need for alternative or additional treatments, prompting a recent shift towards non-pharmacological interventions. These interventions are diverse and include talking therapies such as cognitive behavioural therapy (CBT) [Reference Sevi, Tekinsav Sutcu, Yesilyurt, Turan Eroglu and Gunes17], acceptance and commitment therapy (ACT) [Reference Shawyer, Farhall, Thomas, Hayes, Gallop and Copolov18], and AVATAR therapy [Reference Garety, Edwards, Jafari, Emsley, Huckvale and Rus-Calafell19], as well as neuromodulatory approaches like repetitive transcranial magnetic stimulation (rTMS) [Reference Hua, Wang, He, Sun, Xu and Zhang20] and transcranial direct current stimulation (tDCS) [Reference Fröhlich, Burrello, Mellin, Cordle, Lustenberger and Gilmore21]. Some studies have also investigated the role of music therapy and mindfulness-based therapies [Reference Pinar and Tel22, Reference Chadwick, Strauss, Jones, Kingdon, Ellett and Dannahy23]. While research increasingly focuses on such approaches, emerging pharmacological strategies, such as the combined use of two long-acting injectable antipsychotics in treatment-resistant cases, are also being explored to optimise outcomes in challenging cases [Reference Cipolla, Catapano, D’Amico, Monda, Sallusto and Perris24]. Non-pharmacological treatments are often trialled alongside antipsychotic medications. However, there is still uncertainty surrounding the efficacy of non-pharmacological treatments, in addition to a lack of specifically tailored guidance for their implementation in clinical settings. Reasons for this may include conflicting evidence across various studies, summarised in previous systematic reviews focusing on specific types of non-pharmacological treatments [Reference Dellazizzo, Giguère, Léveillé, Potvin and Dumais25–Reference Rashidi, Jones, Murillo-Rodriguez, Machado, Hao and Yadollahpour28]. Moreover, previous systematic reviews have focused on specific subtypes of non-pharmacological interventions for AVHs, lacking in critical evaluation and comparison between different types of non-pharmacological interventions [Reference Dellazizzo, Giguère, Léveillé, Potvin and Dumais25–Reference Rashidi, Jones, Murillo-Rodriguez, Machado, Hao and Yadollahpour28]. Additionally, such studies have not exclusively reviewed randomised controlled trials (RCTs), which are the most methodologically robust way to test an intervention [Reference Zabor, Kaizer and Hobbs29]. Therefore, this study seeks to provide an evaluation of recent RCTs testing non-pharmacological treatments for AVHs in SSDs. To the best of our knowledge, this is the first meta-analysis to be conducted on all non-pharmacological interventions for AVHs in SSDs.
Aims
The specific objectives of this systematic review and meta-analysis were:
-
1. To evaluate the overall efficacy of non-pharmacological interventions for AVHs based on RCTs conducted in the last decade.
-
2. To compare the effectiveness of different types of interventions, such as CBT, rTMS and AVATAR therapy, measured by validated rating scales.
-
3. To assess the methodological quality and limitations of included studies.
-
4. To identify gaps in research and provide recommendations for future studies and clinical guidelines.
Methods
Search strategy, selection criteria, and quality assessment
This systematic review and meta-analysis was performed in accordance with the Preferred reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [Reference Moher, Liberati, Tetzlaff and Altman30] (see Supplementary Figure S1). The protocol was preregistered on PROSPERO (ID: CRD42024598615) on 18 November 2024.
Five electronic databases were searched on 25 November 2024. Searches were performed using MeSH terms and Boolean operators, outlined in Table 1. Full search strings are provided in Supplementary Table S2. Additional search filters were applied, detailed in Table 2.
Table 1. Search details

Table 2. Search filters applied to databases, and number of studies yielded

Inclusion criteria for studies were a) Adults aged ≥18 years, b) the majority of participants with a formal diagnosis of schizophrenia or other psychotic disorder, c) RCTs, d) any treatment/therapy for auditory hallucinations, e) papers published in English, f) papers published from 2013 and onwards. This publication date cutoff was used to focus on the most recent and clinically relevant studies, avoid overlap with earlier reviews, and maintain a manageable scope given the growing body of literature. Papers including participants with substance misuse or a coexisting neurological condition were excluded. All other study types that were not RCTs were excluded.
The Standard Quality Assessment Criteria for Evaluating Primary Research Papers from a Variety of Fields, also known as the QualSyst Tool [Reference Kmet, Cook and Lee31] detailed in Table 3, was used to evaluate the risk of bias of each article that met the inclusion criteria. Risk of bias scores for each study were calculated by both reviewers independently, then compared to evaluate inter-operator agreement, and an inclusion threshold of 0.55 was used. Bias assessment was conducted on all 77 papers, which underwent full-text screening for eligibility. All papers met the 0.55 threshold; therefore, no studies were excluded based on methodological quality. The final selection of 45 studies included in the meta-analysis was based on the previously detailed inclusion criteria, including the availability of AVH-specific outcome data. If an article met all eligibility criteria but lacked data on the AVH measure, corresponding authors were contacted. Out of 14 corresponding authors contacted, 2 responded with the post-treatment data required [Reference Chadwick, Strauss, Jones, Kingdon, Ellett and Dannahy23, Reference Xie, Guan, He, Wang, Ma and Fang32]. The other 11 studies were not included in this meta-analysis [Reference Johnsen, Sinkeviciute, Loberg, Kroken, Hugdahl and Jorgensen33–Reference Gornerova, Brunovsky, Klirova, Novak, Zaytseva and Koprivova43]. Another paper was excluded [Reference Yuanjun, Guan, Zhang, Ma, Wang and Lin44] due to the author responding to explain that their paper reanalysed the data reported in study 31. QualSyst bias assessment scores for each study are provided in Supplementary Table S3.
Table 3. QualSyst risk of bias assessment process

Data synthesis and analysis
Data extraction started on 17 December 2024. The following information was extracted from each article: setting, patient sex, age, ethnicity, diagnosis, other treatments participants received (other than the intervention of interest), intervention tested, control condition, frequency and dosage of intervention, AVH scoring system, number of participants in each group, mean and standard deviations of the AVH score at baseline and post-treatment, and a summary of key findings.
The primary outcome measure was the difference in post-treatment AVH scores between the control group and intervention group, where the AVH score measured at least one of the following: frequency, distress, or intensity of AVH. The data required to calculate change scores and standard deviations were inconsistently reported; therefore, to avoid making assumptions, using post-treatment AVH scores was most suitable. This is appropriate in the absence of change score data, given that there are minimal baseline differences between the groups. Baseline scores were carefully considered for all studies, and were similar across groups, providing a suitable and comparable measure of treatment effect.
Analyses were performed using the software Comprehensive Meta-Analysis Version 4 (CMA V4), [Reference Borenstein, Hedges, Higgins and Rothstein45] which assessed the heterogeneity of the studies, and calculated the effect size for each study and overall. Forrest plots were produced, using a random effects model with 95% confidence intervals. The heterogeneity of the studies (the outcome variability due to clinical and methodological differences) was measured with Higgins’s I2 statistic [Reference Higgins, Thompson, Deeks and Altman46]. Higgins’s I2 represents the percentage of variation between the sample estimates that is due to heterogeneity rather than sampling error [Reference Higgins, Thompson, Deeks and Altman46]. Subgroup analyses were performed for an intervention reported in at least five studies.
Risk of publication bias was independently assessed by both reviewers through visual inspection of the funnel plot, which was created using CMA V4 [Reference Muka, Glisic, Milic, Verhoog, Bohlius and Bramer47].
Results
The initial search produced 717 records. After removal of duplicates, titles, and abstracts of 399 articles were screened. If titles and abstracts appeared relevant, full texts were retrieved. 77 full-text articles were assessed for eligibility. 45 articles met all eligibility criteria and were included in the meta-analysis (Figure 1).

Figure 1. PRISMA flow diagram for study inclusion.

Figure 2. Forest plot of standardized mean differences (SMDs) between post-intervention AVH scores in those who received an intervention for AVHs (A), versus the control group (B). Negative effect sizes indicate lower (improved) AVH scores in the intervention group compared to the control group.
Study characteristics and participant characteristics
Studies identified included the following 14 types of treatment: rTMS, CBT, transcranial direct current stimulation (tDCS), AVATAR therapy, transcranial alternating current stimulation (tACS), smartphone app treatments, acceptance and commitment therapy (ACT), psychological online intervention (POI), relating therapy (RT), counselling, music therapy (MT), referral to a community psychologist, group mindfulness-based intervention, and rehabilitation. Supplementary Figure S1 reports the number of each intervention type included in the analyses.
For all studies, sample sizes ranged from 12 to 201 participants, with a median of 37. Inclusion criteria across studies involved adults diagnosed with schizophrenia or related disorders who experienced AVHs. Various AVH scoring systems were used across studies, including the Auditory Hallucinations Rating Scale (AHRS) [Reference Haddock, McCarron, Tarrier and Faragher48], Positive and Negative Syndrome Scale – Auditory Hallucinations (PANSS AH) [Reference Mortimer49], Psychotic Syndrome Rating Scales – Auditory Hallucinations (PSYRATS AH) [Reference Haddock, McCarron, Tarrier and Faragher48], The Hamilton Program for Schizophrenia Voices Questionnaire (HPSVQ) [Reference Van Lieshout and Goldberg50], Delusion and Voices Self-Assessment (DV-SA) [Reference Pinto, Gigantesco, Morosini and La Pia51], and Characteristics of Auditory Hallucination Questionnaire (CAHQ) [Reference Buccheri, Trygstad, Dowling, Hopkins, White and Griffin52–Reference Buccheri, Trygstad, Buffum, Lyttle and Dowling54]. Each scoring system included a measurement of at least one of frequency, distress, or intensity of AVHs. For all AVH scoring systems used, a lower score indicated better AVH outcomes. A change in baseline to post-treatment scores was interpreted as reflecting treatment effects, with a decrease in scores indicating improvement.
Trials were conducted across 20 different countries. The majority of studies (31) were conducted in Western countries, with an emerging representation (12) from Asia. There was an underrepresentation from Africa and Latin America, with only two studies conducted in these regions.
In all but two trials [Reference Fitzgerald, McQueen, Daskalakis and Hoy55, Reference Percie du Sert, Potvin, Lipp, Dellazizzo, Laurelli and Breton56], it was stated that the intervention being explored was administered alongside usual treatments, with a common theme of maintaining stable medication regimes. This approach ensured that any observed changes could be attributed to the intervention, rather than changes in established routine treatments.
Participant demographics and study characteristics are detailed in Tables 4 and 5, respectively.
Table 4. Demographics of included studies [Reference Sevi, Tekinsav Sutcu, Yesilyurt, Turan Eroglu and Gunes17–Reference Chadwick, Strauss, Jones, Kingdon, Ellett and Dannahy23, Reference Xie, Guan, He, Wang, Ma and Fang32, Reference Fitzgerald, McQueen, Daskalakis and Hoy55–Reference Jongeneel, Delespaul, Tromp, Scheffers, van der Vleugel and de Bont91]

Abbreviations: c, control group, I, intervention group, TAU, treatment as usual, t, total combined value for both the intervention and control groups.
Table 5. Characteristics of included studies [Reference Sevi, Tekinsav Sutcu, Yesilyurt, Turan Eroglu and Gunes17–Reference Chadwick, Strauss, Jones, Kingdon, Ellett and Dannahy23, Reference Xie, Guan, He, Wang, Ma and Fang32, Reference Fitzgerald, McQueen, Daskalakis and Hoy55–Reference Jongeneel, Delespaul, Tromp, Scheffers, van der Vleugel and de Bont91]

Abbreviations: ACT, acceptance and commitment therapy; AHRS, Auditory Hallucinations Rating Scale; c, control group; CAHQ, Characteristics of Auditory Hallucination Questionnaire; CBT, cognitive behavioural therapy.; DV-SA, Delusion and Voices Self-Assessment; HPSVQ, The Hamilton Program for Schizophrenia Voices Questionnaire; i, intervention group; MT, music therapy; PANSS AH, Positive and Negative Syndrome Scale – Auditory Hallucinations; PSYRATS AH, Psychotic Syndrome Rating Scales – Auditory Hallucinations; RT, relating therapy; rTMS, repetitive transcranial magnetic stimulation; tACS, transcranial alternating current stimulation; tDCS, transcranial direct current stimulation.
Synthesis of results
Overall
Our sample size included 2,314 participants across all studies. The funnel plot (Figure 3) revealed an even distribution of studies reporting both positive and negative effects, and no bias for studies reporting larger, rather than smaller, effect sizes. An additional precision funnel plot was created to provide a clearer depiction of the relationship between study size and effect size (Supplementary Figure S2). Overall, a medium and statistically significant effect size of −0.298 (95% CI, [−0.470, −0.126], Z = −3.400, p = 0.001) was observed across all non-pharmacological interventions (Figure 2). There was significant heterogeneity between studies (Q = 171.566 with 44 degrees of freedom, p < 0.001, I 2 = 74%), indicating substantial variability in effect sizes.

Figure 3. Funnel plot for publication bias.
Subgroup analysis of interventions
As significant heterogeneity was determined, subgroup analyses of AVATAR therapy, CBT, rTMS, and tDCS were performed to explore whether different intervention types influenced effect sizes.
AVATAR therapy
Five studies tested the use of AVATAR therapy and had sample sizes ranging from 22 to 201 with a median of 74 [Reference Garety, Edwards, Jafari, Emsley, Huckvale and Rus-Calafell19, Reference Percie du Sert, Potvin, Lipp, Dellazizzo, Laurelli and Breton56, Reference Craig, Rus-Calafell, Ward, Leff, Huckvale and Howarth75, Reference Dellazizzo, Potvin, Phraxayavong and Dumais84, Reference Liang, Li, Guo, Liu, Liu and Zhao87]. All studies used the PSYRATS AH scoring system and were carried out across three different countries and continents: the UK, Canada, and China. AVATAR therapy showed a medium and statistically significant effect size of −0.425 (95% CI, [−0.601, −0.250], Z = −4.741, p < 0.001) (Figure 4). Effect sizes were highly consistent across studies, showing no significant heterogeneity (Q = 2.649, I 2 = 0%).

Figure 4. Forest plot of standardized mean differences (SMD) between post-intervention AVH scores in those who received AVATAR therapy for AVHs (A), versus the control group (B). Negative effect sizes indicate lower AVH scores in the intervention group compared to the control group.
Cognitive behavioural therapy
Eight studies tested the use of CBT and had sample sizes ranging from 26 to 116, with a median of 35 [Reference Sevi, Tekinsav Sutcu, Yesilyurt, Turan Eroglu and Gunes17, Reference Krakvik, Grawe, Hagen and Stiles58–Reference Naeem, Saeed, Irfan, Kiran, Mehmood and Gul61, Reference Naeem, Johal, McKenna, Rathod, Ayub and Lecomte67–Reference Husain, Chaudhry, Mehmood, Rehman, Kazmi and Hamirani70, Reference Hazell, Hayward, Cavanagh, Jones and Strauss76]. AVH scoring systems varied between studies, including both HPSVQ and PSYRATS AH. Studies were carried out across six different countries including Pakistan, the UK, Canada, the USA, Norway, and Turkey. CBT showed a medium and statistically significant effect size of −0.396 (95% CI, [−0.764, −0.027], Z = -2.103, p = 0.035) (Figure 5). Effect sizes were highly inconsistent across studies, showing moderate and statistically significant heterogeneity (Q = 19.624, I 2 = 64%, p = 0.006).

Figure 5. Forest plot of standardized mean differences (SMD) between post-intervention AVH scores in those who received CBT for AVHs, versus the control group. Negative effect sizes indicate lower AVH scores in the intervention group compared to the control group.
Repetitive transcranial magnetic stimulation
Nine studies tested the use of rTMS and had sample sizes ranging from 27 to 64, with a median of 36 [Reference Hua, Wang, He, Sun, Xu and Zhang20, Reference Xie, Guan, He, Wang, Ma and Fang32, Reference Klirova, Horacek, Novak, Cermak, Spaniel and Skrdlantova57, Reference Bais, Vercammen, Stewart, van Es, Visser and Aleman59, Reference Ray, Sinha and Tikka62–Reference Koops, van Dellen, Schutte, Nieuwdorp, Neggers and Sommer64, Reference Paillère-Martinot, Galinowski, Plaze, Andoh, Bartrés-Faz and Bellivier71, Reference Tyagi, Dhyani, Khattri, Tejan, Tikka and Garg89]. All studies used the AHRS and were carried out across six different countries: the Czech Republic, the Netherlands, France, India, Japan, and China. rTMS showed a small and not statistically significant effect size of −0.267 (95% CI, [−0.788 to 0.255], Z = −1.003, p = 0.316) (Figure 6). Effect sizes were highly inconsistent across studies, showing substantial and statistically significant heterogeneity (Q = 50.842, I 2 = 84%, p < 0.001).

Figure 6. Forest plot of standardized mean differences (SMD) between post-intervention AVH scores in those who received rTMS for AVHs (A), versus the control group (B). Negative effect sizes indicate lower AVH scores in the intervention group compared to the control group.
tDCS
Eleven studies tested the use of tDCS and had sample sizes ranging from 7 to 50, with a median of 13 [Reference Fröhlich, Burrello, Mellin, Cordle, Lustenberger and Gilmore21, Reference Fitzgerald, McQueen, Daskalakis and Hoy55, Reference Mondino, Jardri, Suaud-Chagny, Saoud, Poulet and Brunelin66, Reference Bose, Shivakumar, Agarwal, Kalmady, Shenoy and Sreeraj73, Reference Chang, Tzeng, Chao, Yeh and Chang74, Reference Koops, Blom, Bouachmir, Slot, Neggers and Sommer77–Reference Lindenmayer, Kulsa, Sultana, Kaur, Yang and Ljuri80, Reference LdCL, Goerigk, Gordon, Padberg, Serpa and Koebe83, Reference Klein, Vanneste and Pinkham86]. Scoring systems used varied between studies, including both the AHRS and PANSS AH. Studies were carried out across eight different countries, including: USA, France, India, Taiwan, the Netherlands, Brazil, Ireland, and Canada. tDCS showed a very small and not statistically significant effect size of −0.141(95% CI, [−0.367, 0.086], Z = −1.085, p = 0.278) (Figure 7). Effect sizes were inconsistent across studies, showing moderate, but not statistically significant heterogeneity (Q = 16.105, I 2 = 34%, p = 0.097).

Figure 7. Forest plot of standardized mean differences (SMD) between post-intervention AVH scores in those who received tDCS for AVHs (A), versus the control group (B). Negative effect sizes indicate lower AVH scores in the intervention group compared to the control group.
Transcranial alternating current stimulation
The study by Zhang et al. [Reference Zhang, Force, Walker, Ahn, Jarskog and Frohlich90] reported a medium-sized, non-significant effect, favouring those in the control group (Hedge’s g = 0.569 (95% CI, [−0.510, 1.649], p = 0.301, Z = 1.034).
Smartphone app
Two studies trialled a smartphone app intervention. The study by Bell et al. [Reference Bell, Rossell, Farhall, Hayward, Lim and Fielding-Smith81] trialled blended coping-focused therapy using ecological momentary assessment and intervention via a smartphone app. They found a medium, but not statistically significant effect size favouring the intervention (Hedge’s g = −0.558, 95% CI [−1.22, 0.127], p = 0.11, z = -1.597). The study by Jongeneel et al. [Reference Jongeneel, Delespaul, Tromp, Scheffers, van der Vleugel and de Bont91] trialled the use of the smartphone app, Temstem, which consists of two language games: “Silencing” to improve control over voices, and “Challenging” to reduce vividness and emotionality of voices. No significant difference was observed between those using the two language games on the Temstem app, and those in the control group (Hedge’s g = −0.092, 95% CI [−0.507, 0.324], p = 0.666, z = −0.432).
Psychological online intervention
Lüdtke et al. [Reference Lüdtke, Platow-Kohlschein, Rüegg, Berger, Moritz and Westermann82] found that partaking in a POI resulted in no significant difference in AVHs compared to the control group (Hedge’s g = −0.089, 95% CI [−0.672, 0.493], p = 0.763, z = −0.301).
Acceptance and commitment therapy
Two studies trialled ACT. Shawyer et al. [Reference Shawyer, Farhall, Thomas, Hayes, Gallop and Copolov18] found that ACT demonstrated no significant difference in AVHs compared to the control group (Hedge’s g = 0.013, 95% CI [−3.87, 0.413], p = 0.949, z = 0.063). In contrast, El Ashry et al. [Reference El Ashry, Abd El Dayem and Ramadan85] found that ACT demonstrated a very large and statistically significant effect size, favouring the intervention (Hedge’s g = −2.891, 95% CI [−3.595, −2.189], p < 0.001, z = −8.07).
Music therapy
Rast tonality MT, explored by Pinar et al., [Reference Pinar and Tel22] found a large, statistically significant effect size favouring the control group, who did not listen to any music whilst in hospital (Hedge’s g = 0.972, 95% CI [0.189, 1.756], p = 0.015, z = 2.433).
Relating therapy
Hayward et al. [Reference Hayward, Jones, Bogen-Johnston, Thomas and Strauss69] found that RT showed a medium, but not statistically significant effect size, favouring the intervention (Hedge’s g = −0.466, 95% CI [−1.245, 0.313], p = 0.241, z = −1.172).
Counselling
Schnackenberg et al. [Reference Schnackenberg, Fleming and Martin72] found that experienced focused counselling, had a very small and non-significant effect size, favouring the intervention (Hedge’s g = −0.092, 95% CI [−0.507, 0.896], p = 0.662, z = 0.437).
Rehabilitation (attention training programme)
The study by López-Luengo et al. [Reference Lopez-Luengo and Muela-Martinez65] found that rehabilitation in the form of an attention training programme produced a large, but not statistically significant effect size, favouring the intervention (Hedge’s g = −0.818, 95% CI [−3.593, −2.189], p = 0.116, z = −1.571).
Referral to community psychologist
Solar et al. [Reference Solar, Bennett and Hulse88] found that referring participants to a community psychologist resulted in a medium and non-significant effect size, favouring the control, which involved TAU (standard hospital care, with no direct psychologist appointment, but could self-request a referral) (Hedge’s g = 0.376, 95% CI [−0.598, 1.350], p = 0.449, z = 0.757).
Group mindfulness-based intervention
Chadwick et al. [Reference Chadwick, Strauss, Jones, Kingdon, Ellett and Dannahy23] found that receiving a group mindfulness-based intervention resulted in no significant difference in AVHs compared to the control group (Hedge’s g = −0.065, 95% CI [−0.442, 0.312], p = 0.736, z = −0.337).
Discussion
Summary of findings
This systematic review and meta-analysis summarised evidence from the last decade on fourteen different non-pharmacological interventions for schizophrenia and related disorders, through an evaluation of RCT studies. AVATAR therapy demonstrated the most favourable results for treating AVHs and was highly consistent across all studies. The larger sample sizes and geographical spread of RCTs exploring AVATAR therapy increase the generalisability of results. Overall, AVATAR therapy stands out as a promising candidate for future clinical implementation for treating AVHs; however, further replication across diverse settings and populations is needed before widespread adoption. CBT had promising, and statistically significant results of moderate effect size for improving AVHs; however, substantial heterogeneity was observed between studies. This may be due to the differences in methodologies employed in different trials, such as differences in CBT techniques, treatment duration, and therapist expertise. Non-invasive brain stimulation techniques, including rTMS, tDCS, and tACS did not yield statistically significant results in this meta-analysis; however, this may reflect variabilities in protocols, insufficient power, or other study-level factors rather than a definitive lack of clinical effect. However, rTMS had large variability in results across different studies, suggesting that differences in methodologies, potentially concerning anatomical location, and exact rTMS protocol, along with possible differences in patient populations, may be impacting the results. In contrast tDCS studies showed no significant heterogeneity, with more studies consistently reporting non-significant results, suggesting that either tDCS may not be clinically effective for AVHs, or that current studies lack sufficient power or methodological consistency to detect an effect. This conclusion is reinforced by the fact that 11 studies, conducted across the globe, were included in the subgroup analysis, increasing the diversity of samples and, therefore generalisability of the results. ACT demonstrated variable effectiveness at treating AVHs across the two studies. However, despite limited evidence for ACT, the study by El Ashry A. et al. [Reference El Ashry, Abd El Dayem and Ramadan85] had a moderate sample size, a very high QualSyst bias assessment score suggestive of low risk of bias, and was the only study carried out in Africa, a continent frequently underrepresented in research. As therapy is culturally influenced [Reference Sue and Zane92], demonstrating this effect in a non-Western setting could have great implications for effectively treating AVHs in these contexts. However, the contrasting results found by Shawyer et al. [Reference Shawyer, Farhall, Thomas, Hayes, Gallop and Copolov18] demonstrate a need for more trials refining the most effective method for delivering ACT. Rehabilitation, counselling, RT, and referral to a community psychologist all showed improvement in AVHs post-treatment, but these changes did not reach statistical significance, potentially due to limited power, sample size, or variability in study design. Receiving a group mindfulness-based intervention did not show statistical improvement in reducing AVHs. Similarly, smartphone apps and POI also did not yield statistically significant improvements in AVHs, potentially reflecting delivery challenges, suggesting that online platforms may benefit from future refinements before they can be considered for clinical applications. Lastly, MT demonstrated significantly better AVH outcomes in the control group post-treatment, who did not listen to any music during their hospital stay. However, at follow up, there was an improvement in AVHs in those who received MT, highlighting the value of long-term and follow-up RCTs for the treatment of AVHs.
Future research directions and clinical implications
By comparing these findings to existing systematic reviews and meta-analyses, we can assess the consistency, divergence, and potential advancements in understanding the efficacy of treatments for AVHs in schizophrenia spectrum disorders. A previous systematic review by Dellazizzo et al. [Reference Dellazizzo, Giguère, Léveillé, Potvin and Dumais25] explored relational-based therapies for AVHs, including both AVATAR therapy and CBT. Whilst CBT focuses on changing the person’s cognition surrounding their AVHs by, for instance, altering their belief that they must obey their AVHs, AVATAR therapy works on improving the relationship between the person and their AVHs, aiming to give the patient a sense of control. The present meta-analysis found that AVATAR therapy had a moderate size and statistically significant effect favouring the intervention. Dellazizzo et al.’s reported large or very large effect sizes for AVATAR therapy. The difference in effect size may be partially explained by the fact that the previous review included studies ranging from very low to high risk of bias, whereas only studies with low risk of bias were included in the present review. The reliability of our findings is reinforced by the low heterogeneity for AVATAR therapy (I 2 = 0.00%), making the conclusion valuable for future evidence-based clinical practice. However, it is important to note that three of the five included studies were conducted in the UK, with one in Canada and one in China, suggesting a need for greater geographic and cultural diversity in future trials. Continued research is needed to replicate these findings across more diverse populations and healthcare contexts before widespread clinical adoption can be recommended. Future RCTs should therefore not only evaluate AVATAR therapy’s efficacy but also assess its adaptability across different cultural settings to inform its potential integration into clinical practice.
According to the National Institute for Health and Care Excellence (NICE), CBT is currently a recognised treatment for schizophrenia and psychosis and is included in the NICE guidelines as a best-practice treatment recommendation [93]. However, evidence from Dellazizzo et al.’s [Reference Dellazizzo, Giguère, Léveillé, Potvin and Dumais25] systematic review reports that across the five studies included, all but one case series failed to show any significant effects of CBT in reducing PSYRATS AH scores. Similarly, the high heterogeneity across the eight studies included in this meta-analysis (I 2 = 64%) warrants cautious interpretation of the moderate and statistically significant effect size found in favour of CBT. This prompts an exploration of the different methodologies and samples used in trials for CBT. More high-quality research following the methodology of successful trials is needed to increase the reliability of results and to refine the most effective way to deliver CBT. Given the heterogeneity of the findings and mixed statistical significance, further targeted research is needed to clarify for whom and under what conditions CBT is most clinically effective for AVHs. Therefore, guidelines should be revised to specify which patient groups may benefit from CBT, as it is currently a general recommendation for those with schizophrenia or psychosis [93]. However, if AVHs are a main, distressing, and debilitating symptom for an individual, CBT may not be the appropriate treatment for them, and guidelines should reflect symptomology rather than general diagnoses, particularly in the context of schizophrenia spectrum disorders. Additionally, alternative therapies should be considered for best-practice guidelines, such as AVATAR therapy which has shown greater evidence for effectiveness against AVHs compared to CBT.
ACT is another variant of CBT, which focuses on helping individuals to reduce the distress caused by voices. A systematic review by Yıldız [Reference Yıldız94] included three studies which explored AVH outcomes post ACT. The study found that two out of the three studies reported significant improvements in AVHs after receiving ACT. This variability in effectiveness is also present in our results. Given the limited number of RCTs for ACT, this intervention remains under explored and future trials must be conducted to draw reliable conclusions.
With regards to non-invasive brain stimulation techniques, a systematic review and meta-analysis of ten studies by Otani et al. [Reference Otani, Shiozawa, Cordeiro and Uchida26], exploring the use of rTMS for treatment of auditory hallucinations in refractory schizophrenia, found a moderate, statistically significant effect size favouring rTMS (Hedge’s g = 0.49, 95% CI [0.11, 0.88], p = 0.011). Conversely, we found a small, not statistically significant effect size (Hedge’s g = −0.267, [95% CI, −0.788, 0.255], p = 0.316), in favour of rTMS. Multiple factors may have contributed to this difference. First, it is important to note that there was no crossover in the papers included in this meta-analysis and Otani et al.’s meta-analysis, which focused on papers from the years 2000 to 2011. Additionally, Otani et al. included studies that used low-frequency rTMS and rTMS applied only on the left temporoparietal cortex (leTPC) [Reference Otani, Shiozawa, Cordeiro and Uchida26]. Studies in the current meta-analysis included rTMS applied at any anatomical location, including the leTPC, left temporo-parietal junction (TPJ), right TPJ, and personalised localisation of the language perception area. Additionally, our study included high-frequency stimulation, continuous theta-burst stimulation, and low frequency stimulation. These differences may indicate that low-frequency rTMS applied exclusively to the leTPC may be more effective in treating AVHs, highlighting a need for more trials exploring this, and an up-to-date meta-analysis on studies with this inclusion criterion. However, the meta-analysis by Otani et al. did not evaluate each paper for risk of bias prior to inclusion, which may have also affected the results.
Furthermore, a previous systematic review and meta-analysis of 16 studies exploring tDCS by Cheng et al. [Reference Cheng, Louie, Wong, Wong, Leung and Nitsche27] found a moderate and statistically significant effect size in favour of tDCS (SMD = 0.36, 95 % CI [0.02, 0.70]). Although tDCS is also a non-invasive brain stimulation technique, it utilises an electrical current applied across the patient’s scalp [Reference Giordano, Bikson, Kappenman, Clark, Coslett and Hamblin95], versus the use of pulsed magnetic fields in rTMS [Reference Fitzgerald, Daskalakis, Fitzgerald and Daskalakis96]. tDCS also has greater potential for use in clinical practice due to being more cost-effective and easily accessible than rTMS [Reference Gomes, Imtiaz, Groden and Zaidi97]. However, Cheng et al. observed substantial heterogeneity across the included studies (I 2 = 65%), and statistical significance did not survive subsequent sensitivity analyses. Similarly, our meta-analysis found a small and not statistically significant effect size in favour of tDCS (−0.141, 95% CI, [−0.367 to 0.086]), with no significant heterogeneity, reinforcing the reliability of the negative result. Another systematic review of nine studies exploring tDCS by Rashidi et al. found that tDCS was no better than sham tDCS in reducing AVH in 3 studies, and that tDCS had a significant effect in reducing AVHs in the remaining six studies [Reference Rashidi, Jones, Murillo-Rodriguez, Machado, Hao and Yadollahpour28]. The variability in results across systematic reviews suggests that more high-quality RCTs exploring tDCS may be needed to confirm effectiveness.
Strengths and limitations
This systematic review and meta-analysis has a number of strengths. First, the inclusion of a large sample of 45 studies with a total of 2,314 participants increases the statistical power of the meta-analysis and provides a comprehensive summary of evidence from the last decade. Second, PRISMA guidelines were followed, and a robust bias assessment was carried out on each included study, along with an assessment of a funnel plot for publication bias. The funnel plot did not demonstrate obvious publication bias, suggesting that a range of studies with both positive and negative findings were published [Reference Egger, Smith, Schneider and Minder98]. This increases the reliability of our results. Since only high-quality evidence was included, results can be used to influence future research priorities, with the potential to influence clinical practice and healthcare policy. Furthermore, some subgroup analyses had low heterogeneity, particularly for AVATAR therapy. This enables a definite interpretation of effect size [Reference Higgins, Thompson, Deeks and Altman46], which can provide a clear guide for future research plans and clinical applications. The strengths of the present study make it a valuable addition to the literature, including previous scoping reviews on non-pharmacological treatments of AVHs [Reference Smith, Mateos, Due, Bergström, Nordentoft and Clemmensen99].
This systematic review and meta-analysis also has some limitations, which must be addressed. First, 11 eligible studies had to be excluded for missing data regarding pre- and post-treatment AVH scores, despite contacting authors with data requests. Missing data can introduce bias, as negative or non-significant results may have been omitted, leading to a potential overestimation of effect sizes [Reference Little and Rubin100]. Second, while some studies reported follow-up outcomes, this study did not. Consequently, it was not possible to assess the long-term treatment effects; an essential consideration for clinical applicability, as therapeutic impacts may wane, persist, or intensify over time [Reference Frey and Rogers101]. The absence of long-term data limits the strength of conclusions regarding sustained efficacy and patient outcomes. Future RCTs should incorporate follow-up assessments, and meta-analyses would benefit from stratifying and rigorously evaluating longitudinal data. It is important to acknowledge that, due to inconsistent reporting of change scores and associated standard deviations across studies, treatment effects were derived from post-treatment AVH scores. While baseline AVH scores were similar between groups in the included studies, using post-treatment scores rather than change scores may reduce sensitivity to within-group effects and should be considered when interpreting the results. Additionally, there was variability and inconsistency in the assessment tools used to measure AVHs across studies. While all included studies measured at least one core domain of AVHs, such as frequency, intensity, or distress, they employed different psychometric instruments and scales. This variability introduces measurement heterogeneity, which may have complicated data synthesis and interpretation. To account for this, standardised effect sizes were used in the meta-analysis; however, caution is warranted when comparing across studies due to differences in measurement approaches. Furthermore, some subgroups showed substantial and significant heterogeneity in their results, particularly for rTMS and CBT. This makes findings harder to interpret, as high heterogeneity may weaken the generalisability and reliability of pooled effect sizes. For example, findings may reflect study specific factors, rather than true treatment efficacy [Reference Imrey102]. Importantly, lack of statistical significance in some findings should not be interpreted as evidence of no clinical effect. Many interventions may show real clinical benefit that is not detectable due to small sample sizes, methodological limitations, or inadequate statistical power. Future high-quality studies are needed to clarify these effects.
Furthermore, all included studies evaluated the interventions as adjuncts to existing treatment regimens, making it impossible to isolate their specific effects on AVHs. This limitation compromises internal validity, but may enhance external validity by better reflecting how such interventions are implemented in real-world clinical and community settings [Reference Slack and Draugalis103]. The inclusion of some participants who did not have SSDs, but instead had mood or personality disorders with AVHs, may limit how reliably conclusions can be drawn about the treatment of AVHs in SSDs specifically. Future meta-analyses should adopt rigid inclusion criteria for diagnoses. Nonetheless, the present meta-analysis included such studies to ensure a comprehensive overview of all recent RCTs for treatments of AVHs, where the majority of participants had SSDs. Additionally, studies support that AVHs may be phenomenologically similar across these conditions [Reference Slotema, Daalman, Blom, Diederen, Hoek and Sommer104, Reference Toh, Thomas, Hollander and Rossell105] and respond similarly to established treatments such as antipsychotics [Reference Slotema, Daalman, Blom, Diederen, Hoek and Sommer104]. Furthermore, not all studies had a placebo, inactive, or TAU control group. For instance, the study exploring AVATAR therapy by Liang et al. [Reference Liang, Li, Guo, Liu, Liu and Zhao87] utilised an active control group which received CBT, on top of TAU. This makes the true effect of the intervention unclear, as the two interventions were compared head-to-head.
Despite incorporating a wide range of keywords related to emerging non-pharmacological treatments, such as virtual reality therapy, digital interventions (eHealth, mHealth), and psychedelic-assisted psychotherapy, few or no eligible RCTs were identified for many of these approaches. This underscores the need for rigorous, well-powered RCTs to evaluate the efficacy of these novel interventions for AVHs. Addressing these underrepresented domains through targeted investigation may enrich evidence-informed practice and enhance the diversity of therapeutic avenues available to mental health professionals.
Conclusion
This systematic review and meta-analysis highlight that AVATAR therapy has the strongest base of evidence for treating AVHs in SSDs. It is of clinical priority to continue large-scale RCTs for AVATAR therapy, including studies across culturally and geographically diverse populations, to confirm its effectiveness and generalisability. Conversely, CBT requires standardisation to improve reliability with regards to its effectiveness in treating AVHs. This finding supports the ongoing dialogue around the refinement of clinical guidelines, such as those proposed by NICE, towards more symptom-oriented recommendations for SSDs. Although ACT has shown potential in alleviating AVH-related distress, its application in clinical settings warrants additional empirical validation and methodological refinement. Moreover, there remains a clear need for geographically diverse RCTs, including the expansion of studies like El Ashry et al.’s [Reference El Ashry, Abd El Dayem and Ramadan85] that examine ACT in non-Western populations. Finally, while non-invasive brain stimulation techniques appear promising, current evidence is insufficient to justify their integration into clinical practice.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1192/j.eurpsy.2025.10115.
Data availability statement
Data availability is not applicable to this article as no new data were created or analysed in this study.
Author contribution
MC and NS conceived the study, carried out the data collection, bias analysis, statistical analyses and interpretation. MC wrote the first draft, and MC and NS revised and approved the final draft.
Financial support
This work was funded by Brighton and Sussex Medical School. The funding source did not have any role in determining the content of the manuscript.
Competing interests
None declared.


Comments
No Comments have been published for this article.