Hostname: page-component-7dd5485656-frp75 Total loading time: 0 Render date: 2025-10-22T14:36:11.025Z Has data issue: false hasContentIssue false

Using large language models to directly screen electronic databases as an alternative to traditional search strategies such as the Cochrane highly sensitive search for filtering randomized controlled trials in systematic reviews

Published online by Cambridge University Press:  10 October 2025

Viet-Thi Tran*
Affiliation:
Center for Research in Epidemiology and Statistics (CRESS), Université Paris Cité and Université Sorbonne Paris Nord, INSERM, INRAE , Paris, France Centre d’Epidemiologie Clinique, AP-HP, Hôpital Hôtel Dieu, Paris, France
Carolina Grana Possamai
Affiliation:
Center for Research in Epidemiology and Statistics (CRESS), Université Paris Cité and Université Sorbonne Paris Nord, INSERM, INRAE , Paris, France Centre Cochrane France, Paris, France
Isabelle Boutron
Affiliation:
Center for Research in Epidemiology and Statistics (CRESS), Université Paris Cité and Université Sorbonne Paris Nord, INSERM, INRAE , Paris, France Centre d’Epidemiologie Clinique, AP-HP, Hôpital Hôtel Dieu, Paris, France Centre Cochrane France, Paris, France
Philippe Ravaud
Affiliation:
Center for Research in Epidemiology and Statistics (CRESS), Université Paris Cité and Université Sorbonne Paris Nord, INSERM, INRAE , Paris, France Centre d’Epidemiologie Clinique, AP-HP, Hôpital Hôtel Dieu, Paris, France Centre Cochrane France, Paris, France Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY, USA
*
Corresponding author: Viet-Thi Tran; Email: thi.tran-viet@aphp.fr
Rights & Permissions [Opens in a new window]

Abstract

A critical step in systematic reviews involves the definition of a search strategy, with keywords and Boolean logic, to filter electronic databases. We hypothesize that it is possible to screen articles in electronic databases using large language models (LLMs) as an alternative to search equations. To investigate this matter, we compared two methods to identify randomized controlled trials (RCTs) in electronic databases: filtering databases using the Cochrane highly sensitive search and an assessment by an LLM.

We retrieved studies indexed in PubMed with a publication date between September 1 and September 30, 2024 using the sole keyword “diabetes.” We compared the performance of the Cochrane highly sensitive search and the assessment of all titles and abstracts extracted directly from the database by GPT-4o-mini to identify RCTs. Reference standard was the manual screening of retrieved articles by two independent reviewers.

The search retrieved 6377 records, of which 210 (3.5%) were primary reports of RCTs. The Cochrane highly sensitive search filtered 2197 records and missed one RCT (sensitivity 99.5%, 95% CI 97.4% to100%; specificity 67.8%, 95% CI 66.6% to 68.9%). Assessment of all titles and abstracts from the electronic database by GPT filtered 1080 records and included all 210 primary reports of RCTs (sensitivity 100%, 95% CI 98.3% to100%; specificity 85.9%, 95% CI 85.0% to 86.8%).

LLMs can screen all articles in electronic databases to identify RCTs as an alternative to the Cochrane highly sensitive search. This calls for the evaluation of LLMs as an alternative to rigid search strategies.

Information

Type
Research-in-Brief
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology

Highlights

What is already known?

A critical step in systematic reviews involves the definition of a search strategy, with keywords and Boolean logic, to filter electronic databases.

What is new?

We hypothesize that LLMs could directly screen all articles in electronic databases as an alternative to rigid search equations. To investigate this matter, we compared two methods to identify RCTs in electronic databases: the Cochrane highly sensitive search and an assessment by a GPT (OpenAI) of all articles published. The Cochrane highly sensitive search missed one article and had poor specificity. In contrast, assessment of all titles and abstracts extracted from the electronic database by GPT yielded perfect sensibility and high specificity.

Potential impact for RSM readers

In contrast to current uses of LLMs to accelerate search in systematic reviews by mimicking and improving human tasks, we show that it is possible to envision the process of systematic reviews differently, accounting for the ability of LLMs to treat large volumes of data automatically and at low costs.

1 Background

Synthesizing evidence from randomized controlled trials (RCTs) in systematic reviews and meta-analyses is a cornerstone of evidence-based medicine. Yet, current methods for systematic reviews are slow and resource intensive, requiring up to a year for a team when following recommended approaches.Reference Allen and Olkin1, Reference Schmidt, Cree and Campbell2 The exponential growth of medical literature and the demand for timely evidence synthesis have thus driven the development of automated tools to accelerate the review process, from literature screening to data extraction and risk of bias assessment.Reference Lieberum, Töws, Metzendorf, Heilmeyer, Siemens, Haverkamp, Böhringer, Meerpohl and Eisele-Metzger3

One of the first steps of the review process involves the formulation of a search strategy with specific keywords and Boolean logic to filter electronic databases such as PubMed. Guidance documents for systematic reviews recommend combining broad set of search terms for each concept (e.g., population, intervention), with the “OR” operator within concepts and the “AND” operator between concepts.Reference Cooper, Booth, Varley-Campbell, Britten and Garside4 To identify studies with the appropriate design, the Cochrane Collaboration recommends adding the validated “highly sensitive search strategy” to identify RCTs in systematic reviews focused on interventional studies.Reference Robinson and Dickersin5, Reference Lefebvre, Glanville, Briscoe, Featherstone, Littlewood, Metzendorf, Noel-Storr, Paynter, Rader, Thomas and Wieland6 While such approach maximizes sensitivity, it also often results in the retrieval of a large number of irrelevant records, which must then be screened manually.

A recent scoping review of large language model (LLM) applications for evidence synthesis has found that 41% targeted the search strategy by assisting in refining Medical Subject Headings (MeSH) terms, formulating PubMed queries, or translating queries across databases.Reference Lieberum, Töws, Metzendorf, Heilmeyer, Siemens, Haverkamp, Böhringer, Meerpohl and Eisele-Metzger3 In contrast to these approaches, we hypothesize that LLMs’ ability to treat large volumes of data automatically could be used to completely bypass the search strategy step by directly screening all articles in electronic databases as an alternative to rigid search equations.Reference Tran, Gartlehner, Yaacoub, Boutron, Schwingshackl, Stadelmaier, Sommer, Alebouyeh, Afach, Meerpohl and Ravaud7

2 Objectives

To investigate this matter, we compared the performance of two strategies to identify RCTs in a sample of studies extracted from electronic databases: (1) the Cochrane highly sensitive search (i.e., a traditional search strategy using keywords and Boolean logic) and (2) the assessment by an LLM (GPT from OpenAI) of all articles in the sample. The reference standard was the double assessment by human reviewers.

3 Methods

On January 25, 2025, we retrieved a sample of studies indexed in PubMed (without limiting to MEDLINE records) with a publication date between September 1 and September 30, 2024. As there were 179,000 records indexed in PubMed in the time frame of the study, we used the keyword “Diabetes” to reduce the number of publications included in this study. Search strategies are provided in Supplementary Section 1.

Our objective was to identify primary reports of RCTs, excluding secondary and post-hoc analyses, protocols, and systematic reviews and meta-analyses. We considered as an RCT any prospective study designed to evaluate the causal effect of one or more interventions by randomly allocating eligible participants into two or more groups and comparing outcomes between these groups. This definition excluded, for instance, animal studies in which a disease was randomly induced to observe biomarker differences, as disease induction was not considered an intervention.

We performed the Cochrane highly sensitive search (2008 version) using the equation described in the Cochrane Handbook (Index test 1).Reference Lefebvre, Glanville, Briscoe, Featherstone, Littlewood, Metzendorf, Noel-Storr, Paynter, Rader, Thomas and Wieland6 We performed the GPT assessment of all titles and abstracts directly extracted from the database using a zero-shot prompt (i.e., the prompt included no example), inspired by a previous study, and ran on GPT-4o-mini through the application programming interface (API) of OpenAI (Index test 2) (Table 1).Reference Tran, Gartlehner, Yaacoub, Boutron, Schwingshackl, Stadelmaier, Sommer, Alebouyeh, Afach, Meerpohl and Ravaud7 The prompt was initially tested with GPT-4o to ensure high accuracy and then transitioned to GPT-4o-mini to reduce computational costs, as recommended by OpenAIReference Kwatra8 (Supplementary Section 2). The reference standard was the manual screening of the abstracts of retrieved articles by two independent reviewers (VTT and CG), blinded from the results of the index tests. Abstracts that did not explicitly describe the study design, but whose objectives were compatible with an RCT, were retained as potential primary reports of RCTs. Conflicts between reviewers were resolved through discussion and consensus during regular meetings.

Table 1 Prompt and example of output from the GPT model

Note: The prompt leverages the dialogue capability of GPT models to induce reflection on the eligibility of studies and improve performance. The example abstract is that of a systematic review; as it contains keywords specific to RCTs, it is not filtered by the Cochrane highly sensitive search.Reference Abhishek, Ogunkoya, Gugnani, Kaur, Muskawad, Singh, Singh, Soni, Julka and Udoyen10

We assessed the performance of the two index tests by calculating the sensitivity and specificity with 95% confidence intervals (CI).

4 Results

The electronic search retrieved 6377 records, of which 210 (3.5%) were primary reports of RCTs according to the reference standard.

The Cochrane highly sensitive search filtered 2197 records (34.4%), which included 209 primary reports of RCTs, 1988 false positives, and one false negative trial which used the abbreviation “RCT” rather than words such as “random*” in its title and abstract (sensitivity 99.5%, 95% CI 97.4% to100%; specificity 67.8%, 95% CI 66.6% to 68.9%).Reference Shinada, Kokubun, Takano, Iki, Kobayashi, Hamasaki and Taki9

Assessment of all titles and abstracts directly extracted from the electronic database by GPT filtered 1080 records (i.e., a reduction of 50.8% of the number of records from the Cochrane highly sensitive search). GPT included all 210 primary reports of RCTs from the reference standard and 870 false positives (sensitivity 100%, 95% CI 98.3% to100%; specificity 85.9%, 95% CI 85.0% to 86.8%).

5 Discussion

In this proof-of-concept study, we showed that LLMs could directly screen all articles in PubMed to identify RCTs, as an alternative to filtering databases with the Cochrane highly sensitive search. Use of LLMs reduced the number of articles to manually screen by 50%; enabled the identification of one reference missed by the Cochrane highly sensitive search; and improved the explainability of the search by providing reasons for each study excluded (Table 1).

In this study, we used the keyword “Diabetes” for feasibility reasons as there were 179,000 records indexed in PubMed in the time frame of the study. Of note, such a high number remains feasible with LLMs: the cost of analyzing an abstract with GPT-4o-mini was about $0.00051 (March 2025).

While our results suggest that LLMs can be used as a potential alternative to search strategies for identifying RCTs, this was a relatively simple task. Further research is needed to evaluate their performance in more complex tasks such as identifying studies that meet specific content-related inclusion criteria. Such an approach would be particularly relevant in emerging research domains where terminology is evolving and where a single concept may be described using multiple synonyms or context-dependent phrases (e.g., in a review on just-in-time interventions, we used a very broad search strategy due to the absence of standardized terminology across disciplines). However, as shown in this study, leaving traditional search strategies could lead to a very high number of articles to screen (e.g., there were 179,000 records indexed in PubMed over 1 month). A potential solution could be to use a hybrid approach with a search limited to specific keywords to delineate the topic (e.g., disease of interest), followed by LLM-based assessment of the retrieved articles. Nevertheless, these approaches should be evaluated before being implemented.

6 Conclusion

Unlike Boolean queries relying on exact keyword matches, LLMs could therefore identify relevant studies based on meaning and semantic content rather than on form and words used in the title and abstract.

Author contributions

V.-T.T.: conceptualization, data curation, formal analysis, methodology, project administration, writing – original draft. C.G.P.: data curation, formal analysis, writing—review and editing. I.B: conceptualization, resources, writing—review and editing. P.R.: conceptualization, methodology, resources, writing—review and editing.

Competing interest statement

The authors declare that no competing interests exist.

Data availability statement

The data that support the findings of this study are publicly available (abstract articles from electronic databases). A copy of the exact databases used in the study is provided in https://doi.org/10.5281/zenodo.16758565. The core functions required to process the data are appended to this manuscript.

Funding statement

The authors declare that no specific funding has been received for this article.

Supplementary material

To view supplementary material for this article, please visit http://doi.org/10.1017/rsm.2025.10034.

References

Allen, IE, Olkin, I. Estimating time to conduct a meta-analysis from number of citations retrieved. JAMA 1999;282(7):634635. https://doi.org/10.1001/jama.282.7.634.CrossRefGoogle ScholarPubMed
Schmidt, L, Cree, I, Campbell, F. Digital tools to support the systematic review process: an introduction. J Eval Clin Pract. 2025;31(3):e70100. https://doi.org/10.1111/jep.70100.CrossRefGoogle ScholarPubMed
Lieberum, JL, Töws, M, Metzendorf, MI, Heilmeyer, F, Siemens, W, Haverkamp, C, Böhringer, D, Meerpohl, JJ, Eisele-Metzger, A. Large language models for conducting systematic reviews: on the rise, but not yet ready for use: a scoping review. J Clin Epidemiol. 2025;181:111746. https://doi.org/10.1016/j.jclinepi.2025.111746.CrossRefGoogle Scholar
Cooper, C, Booth, A, Varley-Campbell, J, Britten, N, Garside, R. Defining the process to literature searching in systematic reviews: a literature review of guidance and supporting studies. BMC Med Res Methodol. 2018;18(1):85. https://doi.org/10.1186/s12874-018-0545-3.CrossRefGoogle ScholarPubMed
Robinson, KA, Dickersin, K. Development of a highly sensitive search strategy for the retrieval of reports of controlled trials using PubMed. International journal of epidemiology 2002;31(1):150153. https://doi.org/10.1093/ije/31.1.150.CrossRefGoogle ScholarPubMed
Lefebvre, C, Glanville, J, Briscoe, S, Featherstone, R, Littlewood, A, Metzendorf, M, Noel-Storr, A, Paynter, R, Rader, T, Thomas, J, Wieland, L. Chapter 4: Searching for and selecting studies. In: Higgins J, Thomas J, eds. The Cochrane Collaboration and John Wiley & Sons Ltd, Chichester (UK); 2024:67107.Google Scholar
Tran, VT, Gartlehner, G, Yaacoub, S, Boutron, I, Schwingshackl, L, Stadelmaier, J, Sommer, I, Alebouyeh, F, Afach, S, Meerpohl, J, Ravaud, P. Sensitivity and specificity of using GPT-3.5 turbo models for title and abstract screening in systematic reviews and meta-analyses. Ann Intern. Med. 2024;177(6):791799. https://doi.org/10.7326/m23-3389.CrossRefGoogle ScholarPubMed
Kwatra, S. Practical guide for model selection for real-world use cases. 2025. Accessed July 7, 2025. https://cookbook.openai.com/examples/partners/model_selection_guide/model_selection_guide?utm_source=chatgpt.com.Google Scholar
Shinada, T, Kokubun, K, Takano, Y, Iki, H, Kobayashi, K, Hamasaki, T, Taki, Y. Effects of natural reduced water on cognitive functions in older adults: a RCT study. Heliyon 2024;10(19):e38505. https://doi.org/10.1016/j.heliyon.2024.e38505.CrossRefGoogle ScholarPubMed
Abhishek, F, Ogunkoya, GD, Gugnani, JS, Kaur, H, Muskawad, S, Singh, M, Singh, G, Soni, U, Julka, D, Udoyen, AO. Comparative analysis of bariatric surgery and non-surgical therapies: impact on obesity-related comorbidities. Cureus 2024;16(9):e69653. https://doi.org/10.7759/cureus.69653.Google ScholarPubMed
Figure 0

Table 1 Prompt and example of output from the GPT model

Supplementary material: File

Tran et al. supplementary material

Tran et al. supplementary material
Download Tran et al. supplementary material(File)
File 16.9 KB