Highlights
What is already known?
A critical step in systematic reviews involves the definition of a search strategy, with keywords and Boolean logic, to filter electronic databases.
What is new?
We hypothesize that LLMs could directly screen all articles in electronic databases as an alternative to rigid search equations. To investigate this matter, we compared two methods to identify RCTs in electronic databases: the Cochrane highly sensitive search and an assessment by a GPT (OpenAI) of all articles published. The Cochrane highly sensitive search missed one article and had poor specificity. In contrast, assessment of all titles and abstracts extracted from the electronic database by GPT yielded perfect sensibility and high specificity.
Potential impact for RSM readers
In contrast to current uses of LLMs to accelerate search in systematic reviews by mimicking and improving human tasks, we show that it is possible to envision the process of systematic reviews differently, accounting for the ability of LLMs to treat large volumes of data automatically and at low costs.
1 Background
Synthesizing evidence from randomized controlled trials (RCTs) in systematic reviews and meta-analyses is a cornerstone of evidence-based medicine. Yet, current methods for systematic reviews are slow and resource intensive, requiring up to a year for a team when following recommended approaches.Reference Allen and Olkin1, Reference Schmidt, Cree and Campbell2 The exponential growth of medical literature and the demand for timely evidence synthesis have thus driven the development of automated tools to accelerate the review process, from literature screening to data extraction and risk of bias assessment.Reference Lieberum, Töws, Metzendorf, Heilmeyer, Siemens, Haverkamp, Böhringer, Meerpohl and Eisele-Metzger3
One of the first steps of the review process involves the formulation of a search strategy with specific keywords and Boolean logic to filter electronic databases such as PubMed. Guidance documents for systematic reviews recommend combining broad set of search terms for each concept (e.g., population, intervention), with the “OR” operator within concepts and the “AND” operator between concepts.Reference Cooper, Booth, Varley-Campbell, Britten and Garside4 To identify studies with the appropriate design, the Cochrane Collaboration recommends adding the validated “highly sensitive search strategy” to identify RCTs in systematic reviews focused on interventional studies.Reference Robinson and Dickersin5, Reference Lefebvre, Glanville, Briscoe, Featherstone, Littlewood, Metzendorf, Noel-Storr, Paynter, Rader, Thomas and Wieland6 While such approach maximizes sensitivity, it also often results in the retrieval of a large number of irrelevant records, which must then be screened manually.
A recent scoping review of large language model (LLM) applications for evidence synthesis has found that 41% targeted the search strategy by assisting in refining Medical Subject Headings (MeSH) terms, formulating PubMed queries, or translating queries across databases.Reference Lieberum, Töws, Metzendorf, Heilmeyer, Siemens, Haverkamp, Böhringer, Meerpohl and Eisele-Metzger3 In contrast to these approaches, we hypothesize that LLMs’ ability to treat large volumes of data automatically could be used to completely bypass the search strategy step by directly screening all articles in electronic databases as an alternative to rigid search equations.Reference Tran, Gartlehner, Yaacoub, Boutron, Schwingshackl, Stadelmaier, Sommer, Alebouyeh, Afach, Meerpohl and Ravaud7
2 Objectives
To investigate this matter, we compared the performance of two strategies to identify RCTs in a sample of studies extracted from electronic databases: (1) the Cochrane highly sensitive search (i.e., a traditional search strategy using keywords and Boolean logic) and (2) the assessment by an LLM (GPT from OpenAI) of all articles in the sample. The reference standard was the double assessment by human reviewers.
3 Methods
On January 25, 2025, we retrieved a sample of studies indexed in PubMed (without limiting to MEDLINE records) with a publication date between September 1 and September 30, 2024. As there were 179,000 records indexed in PubMed in the time frame of the study, we used the keyword “Diabetes” to reduce the number of publications included in this study. Search strategies are provided in Supplementary Section 1.
Our objective was to identify primary reports of RCTs, excluding secondary and post-hoc analyses, protocols, and systematic reviews and meta-analyses. We considered as an RCT any prospective study designed to evaluate the causal effect of one or more interventions by randomly allocating eligible participants into two or more groups and comparing outcomes between these groups. This definition excluded, for instance, animal studies in which a disease was randomly induced to observe biomarker differences, as disease induction was not considered an intervention.
We performed the Cochrane highly sensitive search (2008 version) using the equation described in the Cochrane Handbook (Index test 1).Reference Lefebvre, Glanville, Briscoe, Featherstone, Littlewood, Metzendorf, Noel-Storr, Paynter, Rader, Thomas and Wieland6 We performed the GPT assessment of all titles and abstracts directly extracted from the database using a zero-shot prompt (i.e., the prompt included no example), inspired by a previous study, and ran on GPT-4o-mini through the application programming interface (API) of OpenAI (Index test 2) (Table 1).Reference Tran, Gartlehner, Yaacoub, Boutron, Schwingshackl, Stadelmaier, Sommer, Alebouyeh, Afach, Meerpohl and Ravaud7 The prompt was initially tested with GPT-4o to ensure high accuracy and then transitioned to GPT-4o-mini to reduce computational costs, as recommended by OpenAIReference Kwatra8 (Supplementary Section 2). The reference standard was the manual screening of the abstracts of retrieved articles by two independent reviewers (VTT and CG), blinded from the results of the index tests. Abstracts that did not explicitly describe the study design, but whose objectives were compatible with an RCT, were retained as potential primary reports of RCTs. Conflicts between reviewers were resolved through discussion and consensus during regular meetings.
Table 1 Prompt and example of output from the GPT model

Note: The prompt leverages the dialogue capability of GPT models to induce reflection on the eligibility of studies and improve performance. The example abstract is that of a systematic review; as it contains keywords specific to RCTs, it is not filtered by the Cochrane highly sensitive search.Reference Abhishek, Ogunkoya, Gugnani, Kaur, Muskawad, Singh, Singh, Soni, Julka and Udoyen10
We assessed the performance of the two index tests by calculating the sensitivity and specificity with 95% confidence intervals (CI).
4 Results
The electronic search retrieved 6377 records, of which 210 (3.5%) were primary reports of RCTs according to the reference standard.
The Cochrane highly sensitive search filtered 2197 records (34.4%), which included 209 primary reports of RCTs, 1988 false positives, and one false negative trial which used the abbreviation “RCT” rather than words such as “random*” in its title and abstract (sensitivity 99.5%, 95% CI 97.4% to100%; specificity 67.8%, 95% CI 66.6% to 68.9%).Reference Shinada, Kokubun, Takano, Iki, Kobayashi, Hamasaki and Taki9
Assessment of all titles and abstracts directly extracted from the electronic database by GPT filtered 1080 records (i.e., a reduction of 50.8% of the number of records from the Cochrane highly sensitive search). GPT included all 210 primary reports of RCTs from the reference standard and 870 false positives (sensitivity 100%, 95% CI 98.3% to100%; specificity 85.9%, 95% CI 85.0% to 86.8%).
5 Discussion
In this proof-of-concept study, we showed that LLMs could directly screen all articles in PubMed to identify RCTs, as an alternative to filtering databases with the Cochrane highly sensitive search. Use of LLMs reduced the number of articles to manually screen by 50%; enabled the identification of one reference missed by the Cochrane highly sensitive search; and improved the explainability of the search by providing reasons for each study excluded (Table 1).
In this study, we used the keyword “Diabetes” for feasibility reasons as there were 179,000 records indexed in PubMed in the time frame of the study. Of note, such a high number remains feasible with LLMs: the cost of analyzing an abstract with GPT-4o-mini was about $0.00051 (March 2025).
While our results suggest that LLMs can be used as a potential alternative to search strategies for identifying RCTs, this was a relatively simple task. Further research is needed to evaluate their performance in more complex tasks such as identifying studies that meet specific content-related inclusion criteria. Such an approach would be particularly relevant in emerging research domains where terminology is evolving and where a single concept may be described using multiple synonyms or context-dependent phrases (e.g., in a review on just-in-time interventions, we used a very broad search strategy due to the absence of standardized terminology across disciplines). However, as shown in this study, leaving traditional search strategies could lead to a very high number of articles to screen (e.g., there were 179,000 records indexed in PubMed over 1 month). A potential solution could be to use a hybrid approach with a search limited to specific keywords to delineate the topic (e.g., disease of interest), followed by LLM-based assessment of the retrieved articles. Nevertheless, these approaches should be evaluated before being implemented.
6 Conclusion
Unlike Boolean queries relying on exact keyword matches, LLMs could therefore identify relevant studies based on meaning and semantic content rather than on form and words used in the title and abstract.
Author contributions
V.-T.T.: conceptualization, data curation, formal analysis, methodology, project administration, writing – original draft. C.G.P.: data curation, formal analysis, writing—review and editing. I.B: conceptualization, resources, writing—review and editing. P.R.: conceptualization, methodology, resources, writing—review and editing.
Competing interest statement
The authors declare that no competing interests exist.
Data availability statement
The data that support the findings of this study are publicly available (abstract articles from electronic databases). A copy of the exact databases used in the study is provided in https://doi.org/10.5281/zenodo.16758565. The core functions required to process the data are appended to this manuscript.
Funding statement
The authors declare that no specific funding has been received for this article.
Supplementary material
To view supplementary material for this article, please visit http://doi.org/10.1017/rsm.2025.10034.