Research Synthesis Methods: Volume 16 - Issue 6

Meta-analyzing correlation matrices in the presence of hierarchical effect size multiplicity
Ronny Scherer, Diego G. Campos
Published online by Cambridge University Press:

07 August 2025, pp. 828-858
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
To synthesize evidence on the relations among multiple constructs, measures, or concepts, meta-analyzing correlation matrices across primary studies has become a crucial analytic approach. Common meta-analytic approaches employ univariate or multivariate models to estimate a pooled correlation matrix, which is subjected to further analyses, such as structural equation modeling. In practice, meta-analysts often extract multiple correlation matrices per study from various samples, study sites, labs, or countries, thus introducing hierarchical effect size multiplicity into the meta-analytic data. However, this feature has largely been ignored when pooling correlation matrices for meta-analysis. To contribute to the methodological development in this area, we describe a multilevel, multivariate, and random-effects modeling approach, which pools correlation matrices meta-analytically and, at the same time, addresses hierarchical effect size multiplicity. Specifically, it allows meta-analysts to test various assumptions on the dependencies among random effects, aiding the selection of a meta-analytic baseline model. We describe this approach, present four working models within it, and illustrate them with an example and the corresponding R code.

Optimal large language models to screen citations for systematic reviews
Takehiko Oami, Yohei Okada, Taka-aki Nakada
Published online by Cambridge University Press:

23 June 2025, pp. 859-875
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Recent studies highlight the potential of large language models (LLMs) in citation screening for systematic reviews; however, the efficiency of individual LLMs for this application remains unclear. This study aimed to compare accuracy, time-related efficiency, cost, and consistency across four LLMs—GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B—for literature screening tasks. The models screened for clinical questions from the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024. Sensitivity and specificity were calculated for each model based on conventional citation screening results for qualitative assessment. We also recorded the time and cost of screening and assessed consistency to verify reproducibility. A post hoc analysis explored whether integrating outputs from multiple models could enhance screening accuracy. GPT-4o and Llama 3.3 70B achieved high specificity but lower sensitivity, while Gemini 1.5 Pro and Claude 3.5 Sonnet exhibited higher sensitivity at the cost of lower specificity. Citation screening times and costs varied, with GPT-4o being the fastest and Llama 3.3 70B the most cost-effective. Consistency was comparable among the models. An ensemble approach combining model outputs improved sensitivity but increased the number of false positives, requiring additional review effort. Each model demonstrated distinct strengths, effectively streamlining citation screening by saving time and reducing workload. However, reviewing false positives remains a challenge. Combining models may enhance sensitivity, indicating the potential of LLMs to optimize systematic review workflows.

Knowledge user involvement is still uncommon in published rapid reviews—a meta-research cross-sectional study
Barbara Nussbaumer-Streit, Dominic Ledinger, Christina Kien, Irma Klerings, Emma Persad, Andrea Chapman, Claus Nowak, Arianna Gadinger, Lisa Affengruber, Maureen Smith, Gerald Gartlehner, Ursula Griebler
Published online by Cambridge University Press:

10 July 2025, pp. 876-899
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Background
Involving knowledge users (KUs) such as patients, clinicians, or health policymakers is particularly relevant when conducting rapid reviews (RRs), as they should be tailored to decision-makers’ needs. However, little is known about how common KU involvement currently is in RRs.
Objectives
We wanted to assess the proportion of KU involvement reported in recently published RRs (2021 onwards), which groups of KUs were involved in each phase of the RR process, to what extent, and which factors were associated with KU involvement in RRs.
Methods
We conducted a meta-research cross-sectional study. A systematic literature search in Ovid MEDLINE and Epistemonikos in January 2024 identified 2,493 unique records. We dually screened the identified records (partly with assistance from an artificial intelligence (AI)-based application) until we reached the a priori calculated sample size of 104 RRs. We dually extracted data and analyzed it descriptively.
Results
The proportion of RRs that reported KU involvement was 19% (95% confidence interval [CI]: 12%–28%). Most often, KUs were involved during the initial preparation of the RR, the systematic searches, and the interpretation and dissemination of results. Researchers/content experts and public/patient partners were the KU groups most often involved. KU involvement was more common in RRs focusing on patient involvement/shared decision-making, having a published protocol, and being commissioned.
Conclusions
Reporting KU involvement in published RRs is uncommon and often vague. Future research should explore barriers and facilitators for KU involvement and its reporting in RRs. Guidance regarding reporting on KU involvement in RRs is needed.

Regression augmented weighting adjustment for indirect comparisons in health decision modelling
Chengyang Gao, Anna Heath, Gianluca Baio
Published online by Cambridge University Press:

10 July 2025, pp. 900-921
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Background
Understanding the relative costs and effectiveness of all competing interventions is crucial to informing health resource allocations. However, to receive regulatory approval for efficacy, novel pharmaceuticals are typically only compared against placebo or standard of care. The relative efficacy against the best alternative intervention relies on indirect comparisons of different interventions. When treatment effect modifiers are distributed differently across trials, population adjustment is necessary to ensure a fair comparison. Matching-Adjusted Indirect Comparisons (MAIC) is the most widely adopted weighting-based method for this purpose. Nevertheless, MAIC can exhibit instability under poor population overlap. Regression-based approaches to overcome this issue are heavily dependent on parametric assumptions.
Methods
We introduce a novel method, ‘G-MAIC,’ which combines outcome regression and weighting-adjustment to address these limitations. Inspired by Bayesian survey inference, G-MAIC employs Bayesian bootstrap to propagate the uncertainty of population-adjusted estimates. We evaluate the performance of G-MAIC against standard non-adjusted methods, MAIC and Parametric G-computation, in a simulation study encompassing 18 scenarios with varying trial sample sizes, population overlaps, and covariate structures.
Results
Under poor overlap and small sample sizes, MAIC can produce non-sensible variance estimations or increased bias compared to non-adjusted methods, depending on covariate structures in the two trials compared. G-MAIC mitigates this issue, achieving comparable performance to parametric G-computation with reduced reliance on parametric assumptions.
Conclusion
G-MAIC presents a robust alternative to the widely adopted MAIC for population-adjusted indirect comparisons. The underlying framework is flexible such that it can accommodate advanced nonparametric outcome models and alternative weighting schemes.

Novel approaches for random-effects meta-analysis of a small number of studies under normality
Yajie Duan, Thomas Mathew, Demissie Alemayehu, Ge Cheng
Published online by Cambridge University Press:

10 July 2025, pp. 922-936
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Random-effects meta-analyses with only a few studies often face challenges in accurately estimating between-study heterogeneity, leading to biased effect estimates and confidence intervals with poor coverage. This issue is especially the case when dealing with rare diseases. To address this problem for normally distributed outcomes, two new approaches have been proposed to provide confidence limits of the global mean: one based on fiducial inference, and the other involving two modifications of the signed log-likelihood ratio test statistic in order to have improved performance with small numbers of studies. The performance of the proposed methods was evaluated numerically and compared with the Hartung–Knapp–Sidik–Jonkman approach and its modification to handle small numbers of studies. The simulation results indicated that the proposed methods achieved coverage probabilities closer to the nominal level and produced shorter confidence intervals compared to those based on existing methods. Two real examples are used to illustrate the proposed methods.

Simple imputation method for meta-analysis of survival rates when precision information is missing
Kazushi Maruo, Yusuke Yamaguchi, Ryota Ishii, Hisashi Noma, Masahiko Gosho
Published online by Cambridge University Press:

11 September 2025, pp. 937-952
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
In meta-analyses of survival rates, precision information (i.e., standard errors (SEs) or confidence intervals) are often missing in clinical studies. In current practice, such studies are often excluded from the synthesis analyses. However, the naïve deletion of these incomplete data can produce serious biases and loss of precision in pooled estimators. To address these issues, we developed a simple but effective method to impute precision information using commonly available statistics from individual studies, such as sample size, number of events, and risk set size at a time point of interest. By applying this new method, we can effectively circumvent the deletion of incomplete data, resultant biases, and losses of precision. Based on extensive simulation studies, the developed method markedly improves the accuracy and precision of the pooled estimators compared to those of naïve analyses that delete studies with missing precision. Furthermore, the performance of the proposed method was not significantly inferior to the ideal case, where there was no missing precision information. However, for studies for which the risk set size at the time of interest was not available, the proposed method runs the risk of overestimating the SE. Although the proposed method is a single-imputation method, the simulations show that there is no underestimation bias of the SE, even though the proposed method does not consider the uncertainty of missing values. To demonstrate the robustness of our proposed methods, they were applied in a systematic review of radiotherapy data. An R package was developed to implement the proposed procedure.

Combining search filters for randomized controlled trials with the Cochrane RCT Classifier in Covidence: a methodological validation study
Klas Moberg, Carl Gornitzki
Published online by Cambridge University Press:

28 August 2025, pp. 953-960
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Our objective was to evaluate the recall and number needed to read (NNR) for the Cochrane RCT Classifier compared to and in combination with established search filters developed for Ovid MEDLINE and Embase.com. A gold standard set of 1,103 randomized controlled trials (RCTs) was created to calculate recall for the Cochrane RCT Classifier in Covidence, the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE and the Cochrane Embase RCT filter for Embase.com. In addition, the classifier and the filters were validated in three case studies using reports from the Swedish Agency for Health Technology Assessment and Assessment of Social Services to assess impact on search results and NNR. The Cochrane RCT Classifier had the highest recall with 99.64% followed by the Cochrane sensitivity-maximizing RCT filter in Ovid MEDLINE with 98.73% and the Cochrane Embase RCT filter with 98.46%. However, the Cochrane RCT Classifier had a higher NNR than the RCT filters in all case studies. Combining the RCT filters with the Cochrane RCT Classifier reduced NNR compared to using the RCT filters alone while achieving a recall of 98.46% for the Ovid MEDLINE/RCT Classifier combination and 98.28% for the Embase/RCT Classifier combination. In conclusion, we found that the Cochrane RCT Classifier in Covidence has a higher recall than established search filters but also a higher NNR. Thus, using the Cochrane RCT Classifier instead of current state-of-the-art RCT filters would lead to an increased workload in the screening process. A viable option with a lower NNR than RCT filters, at the cost of a slight decrease in recall, is to combine the Cochrane RCT Classifier with RCT filters in database searches.

Trials and triangles: Network meta-analysis of multi-arm trials with correlated arms
Gerta Rücker, Guido Schwarzer
Published online by Cambridge University Press:

01 August 2025, pp. 961-974
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
For network meta-analysis (NMA), we usually assume that the treatment arms are independent within each included trial. This assumption is justified for parallel design trials and leads to a property we call consistency of variances for both multi-arm trials and NMA estimates. However, the assumption is violated for trials with correlated arms, for example, split-body trials. For multi-arm trials with correlated arms, the variance of a contrast is not the sum of the arm-based variances, but comes with a correlation term. This may lead to violations of variance consistency, and the inconsistency of variances may even propagate to the NMA estimates. We explain this using a geometric analogy where three-arm trials correspond to triangles and four-arm trials correspond to tetrahedrons. We also investigate which information has to be extracted for a multi-arm trial with correlated arms and provide an algorithm to analyze NMAs including such trials.

Evaluation of semi-automated record screening methods for systematic reviews of prognosis studies and intervention studies
Isa Spiero, Artuur M. Leeuwenberg, Karel G. M. Moons, Lotty Hooft, Johanna A. A. Damen
Published online by Cambridge University Press:

22 July 2025, pp. 975-989
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Systematic reviews (SRs) synthesize evidence through a rigorous, labor-intensive, and costly process. To accelerate the title–abstract screening phase of SRs, several artificial intelligence (AI)-based semi-automated screening tools have been developed to reduce workload by prioritizing relevant records. However, their performance is primarily evaluated for SRs of intervention studies, which generally have well-structured abstracts. Here, we evaluate whether screening tool performance is equally effective for SRs of prognosis studies that have larger heterogeneity between abstracts. We conducted retrospective simulations on prognosis and intervention reviews using a screening tool (ASReview). We also evaluated the effects of review scope (i.e., breadth of the research question), number of (relevant) records, and modeling methods within the tool. Performance was assessed in terms of recall (i.e., sensitivity), precision at 95% recall (i.e., positive predictive value at 95% recall), and workload reduction (work saved over sampling at 95% recall [WSS@95%]). The WSS@95% was slightly worse for prognosis reviews (range: 0.324–0.597) than for intervention reviews (range: 0.613–0.895). The precision was higher for prognosis (range: 0.115–0.400) compared to intervention reviews (range: 0.024–0.057). These differences were primarily due to the larger number of relevant records in the prognosis reviews. The modeling methods and the scope of the prognosis review did not significantly impact tool performance. We conclude that the larger abstract heterogeneity of prognosis studies does not substantially affect the effectiveness of screening tools for SRs of prognosis. Further evaluation studies including a standardized evaluation framework are needed to enable prospective decisions on the reliable use of screening tools.

Assessing risk of bias of cohort studies with large language models
Danni Xia, Honghao Lai, Weilong Zhao, Jiajie Huang, Jiayi Liu, Ziying Ye, Jianing Liu, Mingyao Sun, Liangying Hou, Bei Pan, Long Ge
Published online by Cambridge University Press:

07 August 2025, pp. 990-1004
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
This study aims to explore the feasibility and accuracy of utilizing large language models (LLMs) to assess the risk of bias (ROB) in cohort studies. We conducted a pilot and feasibility study in 30 cohort studies randomly selected from reference lists of published Cochrane reviews. We developed a structured prompt to guide the ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 to assess the ROB of each cohort twice. We used the ROB results assessed by three evidence-based medicine experts as the gold standard, and then we evaluated the accuracy of LLMs by calculating the correct assessment rate, sensitivity, specificity, and F1 scores for overall and item-specific levels. The consistency of the overall and item-specific assessment results was evaluated using Cohen’s kappa (κ) and prevalence-adjusted bias-adjusted kappa. Efficiency was estimated by the mean assessment time required. This study assessed three LLMs (ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3) and revealed distinct performance across eight assessment items. Overall accuracy was comparable (80.8%–83.3%). Moonshot-v1-128k showed superior sensitivity in population selection (0.92 versus ChatGPT-4o’s 0.55, P < 0.001). In terms of F1 scores, Moonshot-v1-128k led in population selection (F = 0.80 versus ChatGPT-4o’s 0.67, P = 0.004). ChatGPT-4o demonstrated the highest consistency (mean κ = 96.5%), with perfect agreement (100%) in outcome confidence. ChatGPT-4o was 97.3% faster per article (32.8 seconds versus 20 minutes manually) and outperformed Moonshot-v1-128k and DeepSeek-V3 by 47–50% in processing speed. The efficient and accurate assessment of ROB in cohort studies by ChatGPT-4o, Moonshot-v1-128k, and DeepSeek-V3 highlights the potential of LLMs to enhance the systematic review process.

StudyTypeTeller—Large language models to automatically classify research study types for systematic reviews
Simona Emilova Doneva, Shirin de Viragh, Hanna Hubarava, Stefan Schandelmaier, Matthias Briel, Benjamin Victor Ineichen
Published online by Cambridge University Press:

11 September 2025, pp. 1005-1024
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Abstract screening, a labor-intensive aspect of systematic review, is increasingly challenging due to the rising volume of scientific publications. Recent advances suggest that generative large language models like generative pre-trained transformer (GPT) could aid this process by classifying references into study types such as randomized-controlled trials (RCTs) or animal studies prior to abstract screening. However, it is unknown how these GPT models perform in classifying such scientific study types in the biomedical field. Additionally, their performance has not been directly compared with earlier transformer-based models like bidirectional encoder representations from transformers (BERT). To address this, we developed a human-annotated corpus of 2,645 PubMed titles and abstracts, annotated for 14 study types, including different types of RCTs and animal studies, systematic reviews, study protocols, case reports, as well as in vitro studies. Using this corpus, we compared the performance of GPT-3.5 and GPT-4 in automatically classifying these study types against established BERT models. Our results show that fine-tuned pretrained BERT models consistently outperformed GPT models, achieving F1-scores above 0.8, compared to approximately 0.6 for GPT models. Advanced prompting strategies did not substantially boost GPT performance. In conclusion, these findings highlight that, even though GPT models benefit from advanced capabilities and extensive training data, their performance in niche tasks like scientific multi-class study classification is inferior to smaller fine-tuned models. Nevertheless, the use of automated methods remains promising for reducing the volume of records, making the screening of large reference libraries more feasible. Our corpus is openly available and can be used to harness other natural language processing (NLP) approaches.

What happens to qualitative studies initially presented as conference abstracts: A survey among study authors
Marwin Weber, Simon Lewin, Joerg J. Meerpohl, Heather Menzies Munthe-Kaas, Rigmor Berg, Andrew Booth, Claire Glenton, Jane Noyes, Ingrid Toews
Published online by Cambridge University Press:

05 September 2025, pp. 1025-1034
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Qualitative research addresses important healthcare questions, including patients’ experiences with interventions. Qualitative evidence syntheses combine findings from individual studies and are increasingly used to inform health guidelines. However, dissemination bias—selective non-dissemination of studies or findings—may distort the body of evidence. This study examined reasons for the non-dissemination of qualitative studies. We identified conference abstracts reporting qualitative, health-related studies. We invited authors to answer a survey containing quantitative and qualitative questions. We performed descriptive analyses on the quantitative data and inductive thematic analysis on the qualitative data. Most of the 142 respondents were female, established researchers. About a third reported that their study had not been published in full after their conference presentation. The main reasons were time constraints, career changes, and a lack of interest. Few indicated non-publication due to the nature of the study findings. Decisions not to publish were largely made by author teams. Half of the 72% who published their study reported that all findings were included in the publication. This study highlights researchers’ reasons for non-dissemination of qualitative research. One-third of studies presented as conference abstracts remained unpublished, but non-dissemination was rarely linked to the study findings. Further research is needed to understand the systematic non-dissemination of qualitative studies.

Using large language models to directly screen electronic databases as an alternative to traditional search strategies such as the Cochrane highly sensitive search for filtering randomized controlled trials in systematic reviews
Viet-Thi Tran, Carolina Grana Possamai, Isabelle Boutron, Philippe Ravaud
Published online by Cambridge University Press:

10 October 2025, pp. 1035-1041
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
A critical step in systematic reviews involves the definition of a search strategy, with keywords and Boolean logic, to filter electronic databases. We hypothesize that it is possible to screen articles in electronic databases using large language models (LLMs) as an alternative to search equations. To investigate this matter, we compared two methods to identify randomized controlled trials (RCTs) in electronic databases: filtering databases using the Cochrane highly sensitive search and an assessment by an LLM.
We retrieved studies indexed in PubMed with a publication date between September 1 and September 30, 2024 using the sole keyword “diabetes.” We compared the performance of the Cochrane highly sensitive search and the assessment of all titles and abstracts extracted directly from the database by GPT-4o-mini to identify RCTs. Reference standard was the manual screening of retrieved articles by two independent reviewers.
The search retrieved 6377 records, of which 210 (3.5%) were primary reports of RCTs. The Cochrane highly sensitive search filtered 2197 records and missed one RCT (sensitivity 99.5%, 95% CI 97.4% to100%; specificity 67.8%, 95% CI 66.6% to 68.9%). Assessment of all titles and abstracts from the electronic database by GPT filtered 1080 records and included all 210 primary reports of RCTs (sensitivity 100%, 95% CI 98.3% to100%; specificity 85.9%, 95% CI 85.0% to 86.8%).
LLMs can screen all articles in electronic databases to identify RCTs as an alternative to the Cochrane highly sensitive search. This calls for the evaluation of LLMs as an alternative to rigid search strategies.

NMAsurv: An R Shiny application for network meta-analysis based on survival data
Taihang Shao, Mingye Zhao, Fenghao Shi, Mingjun Rui, Wenxi Tang
Published online by Cambridge University Press:

10 July 2025, pp. 1042-1056
- Article
- - You have access
  - Open access
- PDF
- HTML
- Export citation
Network meta-analysis (NMA) is becoming increasingly important, especially in the field of medicine, as it allows for comparisons across multiple trials with different interventions. For time-to-event data, that is, survival data, traditional NMA based on the proportional hazards (PH) assumption simply synthesizes reported hazard ratios (HRs). Novel methods for NMA based on the non-PH assumption have been proposed and implemented using R software. However, these methods often involve complex methodologies and require advanced programming skills, creating a barrier for many researchers. Therefore, we developed an R Shiny tool, NMAsurv (https://psurvivala.shinyapps.io/NMAsurv/). NMAsurv allows users with little or zero background in R to conduct survival-data-based NMA effortlessly. The tool supports various functions such as drawing network plots, testing the PH assumption, and building NMA models. Users can input either reconstructed pseudo-individual participant data or aggregated data. NMAsurv offers a user-friendly interface for extracting parameter estimations from various NMA models, including fractional polynomial, piecewise exponential models, parametric survival models, Cox PH model, and generalized gamma model. Additionally, it enables users to effortlessly create survival and HR plots. All operations can be performed by an intuitive “point-and-click” interface. In this study, we introduce all the functionalities and features of NMAsurv and demonstrate its application using a real-world NMA example.

Research Synthesis Methods

Refine listing

Actions for selected content:

Volume 16 - Issue 6 - November 2025

Tutorial

Meta-analyzing correlation matrices in the presence of hierarchical effect size multiplicity

Research Article

Optimal large language models to screen citations for systematic reviews

Knowledge user involvement is still uncommon in published rapid reviews—a meta-research cross-sectional study

Regression augmented weighting adjustment for indirect comparisons in health decision modelling

Novel approaches for random-effects meta-analysis of a small number of studies under normality

Simple imputation method for meta-analysis of survival rates when precision information is missing

Combining search filters for randomized controlled trials with the Cochrane RCT Classifier in Covidence: a methodological validation study

Trials and triangles: Network meta-analysis of multi-arm trials with correlated arms

Evaluation of semi-automated record screening methods for systematic reviews of prognosis studies and intervention studies

Assessing risk of bias of cohort studies with large language models

StudyTypeTeller—Large language models to automatically classify research study types for systematic reviews

What happens to qualitative studies initially presented as conference abstracts: A survey among study authors

Research-in-Brief

Using large language models to directly screen electronic databases as an alternative to traditional search strategies such as the Cochrane highly sensitive search for filtering randomized controlled trials in systematic reviews

Software Focus

NMAsurv: An R Shiny application for network meta-analysis based on survival data

Research Synthesis Methods

Refine listing

Actions for selected content:

Save Search

Volume 16 - Issue 6 - November 2025

Tutorial

Research Article

Research-in-Brief

Software Focus