StudyTypeTeller—Large language models to automatically classify research study types for systematic reviews

Simona Emilova Doneva; Shirin de Viragh; Hanna Hubarava; Stefan Schandelmaier; Matthias Briel; Benjamin Victor Ineichen

doi:10.1017/rsm.2025.10031

StudyTypeTeller—Large language models to automatically classify research study types for systematic reviews

Published online by Cambridge University Press: 11 September 2025

Simona Emilova Doneva ,

Shirin de Viragh ,

Hanna Hubarava ,

Stefan Schandelmaier ,

Matthias Briel and

Benjamin Victor Ineichen

Show author details

Simona Emilova Doneva*: Affiliation:
Center for Reproducible Science, https://ror.org/02crff812 University of Zurich , Zurich, Switzerland
Shirin de Viragh: Affiliation:
Center for Reproducible Science, https://ror.org/02crff812 University of Zurich , Zurich, Switzerland
Hanna Hubarava: Affiliation:
Center for Reproducible Science, https://ror.org/02crff812 University of Zurich , Zurich, Switzerland
Stefan Schandelmaier: Affiliation:
CLEAR Methods Center, Division of Clinical Epidemiology, Department of Clinical Research, University Hospital Basel, https://ror.org/02s6k3f65 University of Basel , Basel, Switzerland MTA–PTE Lendület “Momentum” Evidence in Medicine Research Group, Medical School, https://ror.org/037b5pv06 University of Pécs , Pécs, Hungary
Matthias Briel: Affiliation:
CLEAR Methods Center, Division of Clinical Epidemiology, Department of Clinical Research, University Hospital Basel, https://ror.org/02s6k3f65 University of Basel , Basel, Switzerland Department of Health Research Methods, Evidence, and Impact, https://ror.org/02fa3aq29 McMaster University, Hamilton, ON, Canada
Benjamin Victor Ineichen: Affiliation:
Center for Reproducible Science, https://ror.org/02crff812 University of Zurich , Zurich, Switzerland Department of Clinical Research, https://ror.org/02k7v4d05 University of Bern, Bern, Switzerland
*: Corresponding author: Simona Emilova Doneva; Email: simona.doneva@uzh.ch

Article contents

Abstract
Highlights
Abbreviations
Introduction
Materials and methods
Results
Discussion
Author contributions
Competing interest statement
Data availability statement
Funding statement
Ethics statement
Footnotes
References

Rights & Permissions

Abstract

Abstract screening, a labor-intensive aspect of systematic review, is increasingly challenging due to the rising volume of scientific publications. Recent advances suggest that generative large language models like generative pre-trained transformer (GPT) could aid this process by classifying references into study types such as randomized-controlled trials (RCTs) or animal studies prior to abstract screening. However, it is unknown how these GPT models perform in classifying such scientific study types in the biomedical field. Additionally, their performance has not been directly compared with earlier transformer-based models like bidirectional encoder representations from transformers (BERT). To address this, we developed a human-annotated corpus of 2,645 PubMed titles and abstracts, annotated for 14 study types, including different types of RCTs and animal studies, systematic reviews, study protocols, case reports, as well as in vitro studies. Using this corpus, we compared the performance of GPT-3.5 and GPT-4 in automatically classifying these study types against established BERT models. Our results show that fine-tuned pretrained BERT models consistently outperformed GPT models, achieving F1-scores above 0.8, compared to approximately 0.6 for GPT models. Advanced prompting strategies did not substantially boost GPT performance. In conclusion, these findings highlight that, even though GPT models benefit from advanced capabilities and extensive training data, their performance in niche tasks like scientific multi-class study classification is inferior to smaller fine-tuned models. Nevertheless, the use of automated methods remains promising for reducing the volume of records, making the screening of large reference libraries more feasible. Our corpus is openly available and can be used to harness other natural language processing (NLP) approaches.

Keywords

animal study clinical study language models natural language processing randomized controlled trial systematic review

Information

Type: Research Article
Information: Research Synthesis Methods , First View , pp. 1 - 20

DOI: https://doi.org/10.1017/rsm.2025.10031 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices: Open data Open materials
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology

Highlights

What is already known?

Screening abstracts for systematic reviews is a labor-intensive process. Large language models, like generative pre-trained transformer (GPT), have been suggested as tools to help categorize abstracts into study types—such as randomized controlled trials (RCTs) or animal studies—before screening, potentially reducing the number of references to review and making the process more efficient. However, there has been no direct comparison of how well GPT performs this task compared to more established models like bidirectional encoder representations from transformer (BERT).

What is new?

We introduce a new dataset of abstracts, labeled by humans according to 14 different study types. Using this dataset, we evaluated how well different GPT and BERT models classify these study types. Our results show that BERT models are particularly effective at this task, making them a good choice for sorting abstracts before systematic review. In fact, BERT models consistently outperformed the newer GPT models in this specific task.

Potential impact for RSM readers

Using BERT models to pre-screen abstracts could reduce the workload of systematic reviews, especially for larger collections of references. And importantly, while GPT models have advanced features and are trained on massive datasets, they were consistently outperformed by the more traditional BERT models for this type of classification.

Abbreviations

1 Introduction

Systematic reviews are fundamental to evidence-based medicine and research.Reference Cumpston, Li and Page¹^, Reference Marianna, Doneva, Howells, Leenaars and Ineichen² Yet conducting these reviews is highly time-consuming, especially the screening of abstracts identified during a comprehensive literature search.Reference Bannach-Brown, Hair, Bahor, Soliman, Macleod and Liao³^, Reference Marshall, Johnson, Wang, Rajasekaran and Wallace⁴ Typically, this screening requires substantial resources with two independent human reviewers and potentially a third one for resolution. As a result, the manual screening of a large number of abstracts can take months,Reference Borah, Brown, Capers and Kaiser⁵ a challenge exacerbated by the ever-increasing volume of scientific publications.Reference Ineichen, Rosso and Macleod⁶ One approach to mitigate the screening burden is to use medical subject headings (MeSH), which provide a standardized vocabulary that refines searches and can reduce irrelevant results. However, their coverage can be incomplete or delayed, and exploiting the full potential of MeSH terminology is an area of active research.Reference Lowe and Barnett⁷^, Reference Wang, Scells, Koopman and Zuccon⁸

With the development of natural language processing (NLP) technologies, particularly large language models (LLMs), there is potential to reduce the manual effort involved in abstract screening. A promising approach is to use such models to trim the number of citations before title and abstract screening, e.g., by classifying articles by study types, such as RCTs, animal studies, systematic reviews, or narrative reviews. Traditional machine learning methods, such as support vector machines and convolutional neural networks, have been successfully used to classify RCTs,Reference Marshall, Noel-Storr, Kuiper, Thomas and Wallace⁹^, Reference Cohen, Smalheiser and McDonagh¹⁰ often outperforming built-in search filters in databases. These models have also been used to develop Multi-Tagger, a system of publication type and study design taggers supporting biomedical indexing and evidence-based medicine. Multi-Tagger covers 50 publication types, including reviews, editorials, and cohort studies.Reference Cohen, Schneider and Fu¹¹ The release of the transformer architecture in 2017 created a new wave of research into Deep Learning, leading to the appearance of the first LLMs.Reference Vaswani, Shazeer and Parmar¹² It was closely followed by the release of the BERT models, which have since been actively applied to the task of text classification in general and the classification of study types, e.g., in relation to animal-alternative methods.Reference Neves, Klippert and Knöspel¹³ More recently, generative models, such as generative pre-trained transformer (GPT) and LLaMA, have also been put to the task of text classification.Reference Sun, Li and Li¹⁴^, Reference Kumar, Sharma and Bedi¹⁵ With this, LLMs have proven to reduce the screening burden of large abstract collections.Reference Van IJzendoorn, Habets, Vinkers and Otte¹⁶^, Reference Abogunrin, Queiros, Bednarski, Sumner, Baehrens and Witzmann¹⁷

However, key gaps remain: First, current methods lack detailed classification levels for study types needed for specific systematic reviews.Reference Van IJzendoorn, Habets, Vinkers and Otte¹⁶ For example, no existing model can automatically classify key study types, such as in vitro studies, animal studies, or studies assessing therapeutic interventions important, for example, for animal or translational systematic reviews.Reference Van IJzendoorn, Habets, Vinkers and Otte¹⁶^, Reference Van Luijk, Bakker, Rovers, Ritskes-Hoitinga, De Vries and Leenaars¹⁸^, Reference Ineichen, Held, Salanti, Macleod and Wever¹⁹ Second, despite the growing popularity of generative models like GPT, there is a lack of empirical data on how these newer models perform in text classification against established transformer-based models, such as BERT. Third, a challenge in advancing these technologies is the need for high-quality, manually annotated data, which is essential for fine-tuning and evaluating models’ performance. Without available annotated datasets, the refinement and successful deployment of NLP models in specialized domains remain constrained.Reference Fries, Seelam and Altay²⁰

To address these gaps, our study has three objectives: first, to develop a manually annotated dataset of abstracts for various study types found in PubMed; second, use this corpus to train different transformer-based models from the BERT family to automatically classify PubMed abstracts by study type; and third, to compare the performance of these NLP methods with the newer generative LLMs, concretely GPT-3.5 and GPT-4.

2 Materials and methods

2.1 Approach and study design

We focused our study on neuroscience, one of the largest biomedical research fields. The data used for classification includes titles, abstracts, keywords, and journal names of research studies found on PubMed. We did not conduct a sample size calculation but relied on a convenience sample of slightly over 2,500 abstracts for our human-annotated corpus of 14 study types. This manually annotated corpus was then used to: 1) assess the performance of GPT-based models in automatically classifying these study types and 2) fine-tune various BERT-based models to compare their performance against GPT in classifying these study types. Details are described in the following paragraphs.

2.2 Data

2.2.1 Data collection

To obtain the initial set of relevant PubMed-IDs (PMIDs), we searched PubMed using the following search string: “Central nervous system diseases [MeSH] OR Mental Disorders Psychiatric illness [MeSH]” (search date: November 27, 2023). Out of 2,788,345 PMIDs, we randomly sampled 2,000 PMIDs for which we fetched the meta-data (title, abstract, author keywords, MeSH terms, journal name, PubMed date of publication, and date of journal publication). Although author keywords and MeSH terms may appear similar, they serve different purposes: Authors typically choose a small set of keywords based on the focus of their article, while indexers systematically assign standardized MeSH terms to place the work within a broader framework.Reference Névéol, Doğan and Lu²¹

2.2.2 Data annotation

We developed detailed annotation guidelines for the different study types which were iteratively improved through three rounds of revision (see Section a of the Supplementary Material or https://osf.io/3yxqh/). The final guidelines are provided in Table S1 in the Supplementary Material. We use a classification system that categorizes study types according to their methodological design, study population, and intervention type. We defined “intervention” specifically in the context of therapeutic intervention, excluding diagnostic procedures or similar procedures. We pre-defined the following 15 study type classes, which are largely mutually exclusive and applied in the following hierarchy:

1. Study-protocol
2. Human-systematic-review
3. Non-systematic-review
4. Human-RCT-non-drug-intervention (RCT)
5. Human-RCT-drug-intervention
6. Human-RCT-non-intervention
7. Human-case-report
8. Human-non-RCT-non-drug-intervention
9. Human-non-RCT-drug-intervention
10. Animal-systematic-review
11. Animal-non-drug-intervention
12. Animal-drug-intervention
13. Animal-other
14. In-vitro-study
15. Remaining (a class for all remaining study types not belonging to any of the above mentioned classes).

Notably, we made a post-hoc decision to exclude the class Animal-systematic-review as we did not identify any such studies in our corpus (see Section 2.2.3). This left us with 14 classes. We will refer to this number of classes in the following.

Two raters (SdV and BVI) independently assigned these study type classes to the set of 2000 references based on title, abstract, and journal name. Conflicts were resolved by discussion. To assess inter-rater agreement, we calculated Cohen’s Kappa statistics. Four references were excluded in accordance with the annotation guidelines (not being in the realm of neuroscience and/or not in English). All annotations and the conflict resolution were performed in the web-based annotation tool Prodigy.Reference Montani and Honnibal²²

Next, to create labels for the binary classification task, we mapped each of the classes either to “Animal” or “Other.” This allowed us to evaluate the models’ abilities in distinguishing between animal and non-animal studies.

2.2.3 Dataset enrichment

The manually annotated corpus showed a very uneven distribution of study types, with some classes having very few samples (e.g., Human-RCT-non-intervention) and the Remaining class containing three times as many samples as the second largest class. Fine-tuning BERT-based models on this unbalanced dataset poses a risk that the model will be biased towards over-predicting the larger classes and performing poorly on the underrepresented classes. See Figure 6a in the Supplementary Material and Section g of the Supplementary Material for the effect of enrichment on the performance of BERT-based models. At the same time, having very few examples of a class in our test set can lead to unreliable performance metrics, high variance in evaluation results, and poor representation of real-world performance. We thus made a post hoc decision to augment each of the minority classes with 50 additional abstracts (Table 1). These additional instances stemmed from previous systematic reviews conducted by our group.

Table 1 Annotated corpus: The table presents key statistics of the dataset after the stratified splitting into train, validation, and test sets

Note: The numeric values represent the number of records in the respective category

2.2.4 Data splits for model training and evaluation

We split the full dataset of 2,645 references into three datasets for training, validation, and testing with a 0.6-0.2-0.2 ratio, resulting in sub-corpora of 1,851, 530, and 534 samples, respectively (Table 1). The train and validation split are only used for fine-tuning the BERT-based models, while the held-out test set is used for both BERT and GPT evaluation. A random sample from the validation dataset was used to select the most promising GPT prompts (see Section 2.3.2).

To ensure that all classes are present across all parts of the split, we employed a stratification strategy which ensured that all classes were present across all dataset splits. Each dataset record included the following fields: title, abstract, author keywords, and journal name. Note that we did not include MeSH terms, as empirical evaluations for a similar task have shown no positive impact on performance.Reference Neves, Klippert and Knöspel¹³

2.3 GPT-based models

2.3.1 Setup

We employed two generative LLMs developed by OpenAI: GPT-3.5-turbo and GPT-4-turbo-preview.Reference Brown, Mann, Ryder, Larochelle, Ranzato, Hadsell, Balcan and Lin²³^, ²⁴ These models have been trained on publicly available data (e.g., Common Crawl, a public dataset of web page data [https://commoncrawl.org] and Wikipedia) and licensed third-party data.Reference Brown, Mann, Ryder, Larochelle, Ranzato, Hadsell, Balcan and Lin²³^, ²⁴ GPT models generate text by predicting the next word in a sequence, based on the context of the previous words. GPT-4 is an advancement over GPT-3.5, featuring a larger model size, enhanced performance (including in the domain of medicine),Reference Sandmann, Riepenhausen, Plagwitz and Varghese²⁵^, Reference Kipp²⁶ and more extensive training data.

We utilized the OpenAI chat-completion API to obtain responses from both models.²⁷ Each abstract’s classification was run in an independent session, and since the API currently does not have a memory function, there is no transfer of information or bias from one abstract to another. The classification task was structured as a question-answering problem, where GPT was prompted to identify the most suitable study type from a predefined study type set based on the paper’s title, abstract, journal name and, when available, keywords. The analysis was conducted with the temperature parameter set at 0 to reduce the randomness of responses (compared to higher values producing more creative and random outputs).

Furthermore, to deal with variation in GPT responses, we configured the output format of the API call to a specific structured output type (“json_object”). Additionally, we used a fuzzy matching approach to account for minor differences between model outputs and expected labels.

2.3.2 Prompting strategy

We pre-defined three prompting strategies:

1. Zero-shotReference Nori, King, McKinney, Carignan and Horvitz²⁸^, Reference Brin, Sorin and Vaid²⁹

The model is given the classification prompt without any additional context, examples, or specific instructions, relying solely on its pre-trained knowledge.
2. contextual clues (CC)³⁰

The model is provided with our detailed annotation guidelines (see Section a of the Supplementary Material and Table S1 in the Supplementary Material for the full version and Section b of the Supplementary Material and Table S2 in the Supplementary Material for the shortened version). These guidelines outline the classification criteria and labeling conventions, enabling the model to use these information to create an output.
3. chain-of-thought (CoT)³⁰^, Reference Nori, Lee and Zhang³¹

The model is instructed to generate a chain of thought (step-by-step reasoning) before providing its final classification. This process encourages explicit reasoning and can improve accuracy by making the decision process more transparent.

For prompt engineering, i.e., designing the input instructions given to the model, we followed OpenAI’s Prompt Engineering Guidelines³⁰ and other published recommendations.Reference Meskó³² Prompts are summarized in Table 2 and detailed in Section j of the Supplementary Material.

Table 2 Overview of employed prompting strategies

Note: Detailed prompts are listed in Section j of the Supplementary Material.

The “original annotation guidelines” refer to the complete guidelines (see Section a of the Supplementary Material and Table S1 in the Supplementary Material).

For the annotation guidelines, see Sections a and b of the Supplementary Material and Tables S1 and S2 in the Supplementary Material.

Abbreviations: CC, contextual clues; CoT, chain-of-thought.

We tested both, 1) classification into one of 14 categories (multi-class) or 2) just two (“Animal” versus “Other,” binary) due to the following reasons: First, it allowed us to compare the performance of BERT-based and GPT-based models on a simpler binary classification task. Second, it allowed us to construct a hierarchical classification pipeline, i.e., a two-step approach where binary classification is followed by more detailed classification (i.e., a form of the Chain-of-Thought-prompting-strategy). We also briefly explored in-context learning (few-shot)Reference Nori, Lee and Zhang³¹ as a prompting strategy, i.e., giving the model a few examples in the prompt (Table S10 in the Supplementary Material). The selection of few-shot examples was based on,Reference Jimenez Gutierrez, McNeal and Washington³³^, Reference Liu, Shen, Zhang, Dolan, Carin and Chen³⁴ as follows:

1. Random Few-Shots: In this approach, the classification examples (few-shots) were randomly selected from the training split.
2. K-Nearest Neighbor (K-NN): Here, the few-shot examples were chosen based on the closest neighbors to each test example from the training split.

2.3.3 Experiments

We first tested the prompting strategies on random samples of 50 abstracts from the validation dataset. To compare GPT with BERT performance, we re-ran all the prompts on the test set (enriched with keywords), see Table S8 in the Supplementary Material. In addition, we tested them on the test split without keywords (Table S9 in the Supplementary Material).

Due to relatively high API costs of GPT-4, we decided to only run the most promising prompts (based on their performance with GPT-3.5) and at least one prompt of each prompting strategy on GPT-4. In total, we tested eight selected prompts on GPT-4 (Table 3).

Table 3 Performance metrics for GPT-3.5-turbo and GPT-4-turbo-preview (for selected prompting strategy) as well as for BERT models (multi-class classification)

Note: Text in bold indicates the best performance

Performance of GPT models is measured on the enriched dataset with keywords.

Abbreviations: CC, contextual clues; CoT, chain-of-thought; “P2_H,” “P2_HIERARCHY,” i.e., hierarchical prompt 2; “b_P3,” binary prompt 3, upon whose outputs this hierarchical labeling was based (see Tables S6 and S7 in the Supplementary Material).

2.4 BERT-based models

2.4.1 Setup

BERT differs from GPT primarily in its training objective. While GPT is trained to predict the next token in a sequence using only the left (previous) context in a unidirectional manner, BERT uses a bidirectional approach through masked language modeling (MLM).Reference Devlin, Chang, Lee and Toutanova³⁵ In MLM, random tokens are masked during training, and BERT learns to predict them using both the left and right context. This bidirectional conditioning enables BERT to capture richer contextual representations.Reference Devlin, Chang, Lee and Toutanova³⁵^, Reference Artetxe, Du, Goyal, Zettlemoyer, Stoyanov, Goldberg, Kozareva and Zhang³⁶

We fine-tuned seven BERT models (accessed via https://huggingface.co/):

• bert-base-uncased (BERT-base);Reference Devlin, Chang, Lee and Toutanova³⁵
• scibert_scivocab_uncased (SciBERT);Reference Beltagy, Lo and Cohan³⁷
• biobert-v1.1 (BioBERT);Reference Lee, Yoon and Kim³⁸
• Bio_ClinicalBERT (ClinicalBERT);Reference Huang, Altosaar and Ranganath³⁹
• BiomedNLP-BiomedBERT-base-uncased-abstract (BiomedBERT);Reference Chakraborty, Bisong, Bhatt, Wagner, Elliott, Mosconi, Scott, Bel and Zong⁴⁰
• BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext (PubMedBERT);Reference Gu, Tinn and Cheng⁴¹
• BioLinkBERT-base (BioLinkBERT).Reference Yasunaga, Leskovec, Liang, Muresan, Nakov and Villavicencio⁴²

Hyperparameter optimization is detailed in Table S4 in the Supplementary Material and Section e of the Supplementary Material. For more detailed information about the models, see Section d of the Supplementary Material. Model selection was guided by the considerations of accessibility, domain-specificity, performance, but also size, giving preference to smaller (base) versions of the models. We also took into account the BLURB leaderboard of models in the domain of biomedicine.Reference Gu, Tinn and Cheng⁴¹

The input text consisted of the journal name, title, abstract, and—if available—keywords, combined into a single string. We did not apply further text preprocessing. Instead, we used the built-in tokenizers (AutoTokenizers) provided with each model to convert the text into tokens. Inputs were padded or cut off to a maximum length of 256 tokens (approximately 200–250 words), based on prior hyperparameter tuning. Default model settings and hyperparameters were used unless specified otherwise in Table S4 in the Supplementary Material and Section e of the Supplementary Material.

We fine-tuned on the train split and validated on the val (validation) split. The fine-tuned models were then evaluated on the previously unseen held-out test split.

2.4.2 Experiments

For all experiments with BERT-based models, we used two classification types: binary and multi-class (see Section 2.3.2).

2.5 Comparative analysis with existing datasets

As a simple baseline for the binary classification task, we used the MeSH terms assigned to the studies. If they contained “animal,” the study was assigned this label, while the remaining studies were assigned the label “other.”

To validate our dataset, we compared it with two recently published abstract classification corpora, Multi-Tagger and GoldHamster.Reference Cohen, Schneider and Fu¹¹^, Reference Neves, Klippert and Knöspel¹³^, Reference Menke, Kilicoglu and Smalheiser⁴³ The GoldHamster dataset supports classification of PubMed abstracts into experimental models, including “in vivo,” “organs,” “primary cells,” “immortal cell lines,” “invertebrates,” “humans,” “in silico,” and “other.” Due to incomplete documentation, we re-implemented their model locally and fine-tuned it on the provided dataset before applying it to the 2,645 abstracts in our corpus. A direct performance comparison with GoldHamster was only feasible for binary classification, where we mapped the “in_vivo” label to “Animal” and all others to “Other.” The additional categories were analyzed to gain further insight into our annotations.

Multi-Tagger focuses on human-related studies, providing classification for 50 labels on publication types (e.g., review and case report) and clinically relevant study designs (e.g., case-control study and retrospective study). We retrieved the predicted probability scores for all PubMed articles (up to 2024) from the publicly available dataset at Multi-Tagger File Download, along with threshold values reported to yield the highest F1-scores. Although these labels were not directly comparable to our own, they offered additional context and aided our understanding of the scope and consistency of our annotations.

2.6 Performance evaluation

To evaluate classification performance in our text classification task, we treat each label in a one-versus-rest manner, thus transforming the problem into a set of binary classification tasks. This allows us to assess how well the model identifies each individual class against all others.

For each class, we compute precision, recall, and F1 score, defined as:

• $ \text {Precision} = \frac {\text {Number of correctly predicted instances of the class}}{\text {Total number of instances predicted as that class}}$ , also referred to as positive predictive value.
• $ \text {Recall} = \frac {\text {Number of correctly predicted instances of the class}}{\text {Total number of actual instances of that class}} $ , also referred to as sensitivity.
• $\text {F1} = \frac {2 \cdot \text {Precision} \cdot \text {Recall}}{\text {Precision} + \text {Recall}}$ , i.e., the harmonic mean of precision and recall.

To assess the reliability of these metrics, we compute confidence intervals. Precision and recall intervals are calculated using binomial proportion methods,Reference Brown, Cai and DasGupta⁴⁴ while F1 confidence intervals are estimated analytically following.Reference Takahashi, Yamamoto, Kuchiba and Koyama⁴⁵

For aggregated (multi-class) performance, we report weighted precision, recall, and F1 scores. These are computed as class-wise scores weighted by the number of true instances in each class:

$$\begin{align*}\text{Weighted Metric} = \sum_{i=1}^{C} w_i \cdot \text{Metric}_i, \quad \text{where } w_i = \frac{n_i}{\sum_{j=1}^{C} n_j} \end{align*}$$

with n _i being the number of true instances in class i, and C the number of classes. This formulation accounts for class imbalance.

Confidence intervals for these aggregated metrics are computed using bootstrapping, where we resample the test set (with replacement) multiple times, compute the metrics for each sample, and derive empirical confidence intervals from the resulting distributions. All calculations are based on validated implementations from a publicly available Python library.Reference Gildenblat⁴⁶ More mathematical detail is provided in Section c of the Supplementary Material.

3 Results

3.1 Corpus

The final corpus contained 2,645 samples (1,996 from the original annotation and 649 added to increase underrepresented classes) (Table 1). We encountered no error messages from the PubMed API and successfully retrieved all available data. All records included a title and abstract, while 6 lacked a journal name and 18 were missing MeSH terms. The largest proportion of missing data was for author keywords, which were absent in around 60% of records across all labels, except for Human-Systematic-Review and Clinical-Study-Protocol, which had about 30% missing (Figure 5 in the Supplementary Material). The inter-annotator-agreement (IAA) for the human annotation according to Cohen’s Kappa was 0.55 (95%-CI: 0.49–0.61, moderate agreement) for the first round and 0.71 (0.63–0.79, substantial agreement) for the second round. The most represented class was the Remaining class with a total of 858 samples, followed by Non-systematic-review with 371 samples. The smallest class was Human-RCT-non-intervention with 54 samples. 379 samples were animal studies and 2,266 were non-animal studies. The mean abstract length was 237 words (min 17–max 1040) (Figure 1 and Table S3 in the Supplementary Material).

Figure 1 Abstract length per class, calculated on the whole dataset before splitting into train, validation, and test sets.

3.2 GPT models

GPT models performed almost perfectly in the binary text classification task (i.e., “Animal” versus “Other”) with F1-scores up to 0.99 (Table S6 in the Supplementary Material). However, this performance dropped substantially for multi-class classification (considering all 14 study classes) with F1-scores between 0.261 and 0.645. GPT-4 consistently outperformed GPT-3.5 (maximum F1-scores 0.540 versus 0.645, respectively) (Table 3 and Figure 2) (only the most promising prompts were tested on GPT-4).

Figure 2 Per-class performance comparison between the best performing prompting strategy for GPT-3.5 and GPT-4 (P2_H_b3, CC), and SciBERT with 95%-confidence intervals.

For multi-class classification, the best performing prompting strategy was hierarchical prompting, i.e., to first classify into “Animal” studies versus “Other” (non-animal) studies, followed by further sub-classification into the 14 classes (with the highest performing prompt being “P2_H_b3” “P2_HIERARCHY” based upon binary prompt P3, Table S7 in the Supplementary Material). The second-best performing prompt was multi-class-prompt P6 (adding the complete annotation guidelines to the GPT-prompt). The lowest performing strategy was zero-shot prompting, i.e., prompting the model without any additional context. All performance metrics for GPT-3.5 are reported in Tables S7 and S8 in the Supplementary Material.

GPT-4’s most inaccurately predicted classes were Human-RCT-non-intervention (0% correctly identified, commonly confused with Human-non-RCT-non-drug-intervention and Human-RCT-non-drug-intervention), Non-systematic-review (57%, commonly confused with Remaining), and Study-protocol (58%, commonly confused with Human-RCT-non-drug-intervention). In addition, the class Remaining was commonly confused with Human-non-RCT-non-drug-intervention (Figure 3). We did not observe an association between class sizes and model performance.

Figure 3 Confusion matrices of the best-performing (a) prompting strategy for the GPT-based model (GPT-4, P2_H_b3, CC) and (b) the BERT-based models (SciBERT).

Figure 4 Overview of top ten predicted labels based on the GoldHamster or Multi-Tagger corpus for (a, b) the full dataset, (c) the abstracts annotated as Remaining in our corpus, and (d) the abstracts annotated as Remaining in our corpus and as human by GoldHamster, highlighted in orange in (c). Subfigure (e) shows the top ten labels from Multi-Tagger containing Randomized Control Trials, abbreviated as RCT, and (f) the corresponding assigned labels to this set of articles in our dataset.

3.3 BERT models

The fine-tuned BERT models also performed almost perfectly in the binary text classification task (i.e., “Animal” versus “Other”) with F1-scores up to 0.99 (Table S6 in the Supplementary Material). And, in contrast to GPT models, this performance remained high for multi-class classification (considering all 14 study classes) with F1-scores between 0.80 and 0.84 for different BERT models (Table 3), with SciBERT being the top-performing model.

SciBERT’s most inaccurately predicted classes were Human-RCT-non-intervention (45% correctly identified, commonly confused with Remaining and Human-RCT-non-drug-intervention) and Animal-non-drug-intervention (69%, commonly confused with the classes Animal-other or Animal-drug-intervention) (Figure 3). The size of the classes (ranging from 54 to 858) was not associated with model performance.

3.4 GPT versus BERT

Overall, SciBERT (the best performing BERT model) outperformed both GPT-3.5 and GPT-4 (Table 4 and Figure 2). Despite overlapping confidence intervals, there was a clear trend that SciBERT also outperformed GPT at class-level for most classes, except that GPT-4 outperformed SciBERT for Human-RCT-drug-intervention, Animal-non-drug-intervention, and Animal-drug-intervention (Figure 2 and Figure S2 in the Supplementary Material).

Table 4 Top performing models and strategies among the GPT and BERT models, evaluated in the multi-class classification task

Abbreviations: “P2_H_b3” is hierarchical prompt P2, “P2_HIERARCHY,” whose prediction was based upon the predictions of binary prompt P3.

3.5 Comparison to related work

The MeSH-based approach performed well in capturing nearly all animal articles (high recall). However, it mislabeled more articles as Animal, which resulted in lower precision. As a result, its overall F1-score was lower than that of the ML-based models (Table S5 in the Supplementary Material).

The GoldHamster model most frequently predicted the labels Other, Human, and In vivo. The model returned no label for around 300 abstracts when the model was uncertain about the correct classification (Figure 4a). For the abstracts in our Remaining category, the GoldHamster model likewise assigned the general category Other most often, followed by Human and not assigned (Figure 4c). Notably, the category In silico was also present. In the binary classification task, the model trained on the GoldHamster corpus performed comparably to SciBERT trained on our dataset (Table S5 in the Supplementary Material). Although its precision for Animal was slightly lower, its recall was higher, likely because multiple labels for some studies led to both more false positives and a broader capture of relevant studies.

When keeping only the Multi-Tagger labels whose scores exceed the provided optimal threshold, there were more than 1000 studies without an assigned label (S1a). To mitigate this, we selected the label with the highest score, when the probability for none of the labels achieved the Multi-Tagger defined thresholds. The top predicted labels over the whole datasets were Review, Clinical Study, and Case Report (Figure 4b). Case-Control Studies, Cross-Sectional Studies, and Retrospective Studies were the most common classifications for the articles we labeled as Remaining (S1b).

We were also interested in the case where the model trained on the GoldHamster dataset predicted Human study type, while our accepted label was Remaining. The most frequent labels predicted by Multi-Tagger for those 263 studies were Case-Control Studies, Clinical Study, and Cross-Sectional Studies (Figure 4d). Furthermore, articles assigned an RCT label by Multi-Tagger were also mostly identified as such in our dataset (Figure 4e and f).

4 Discussion

4.1 Main findings

This study aimed to use GPT-based LLMs for classification of scientific study types and to compare these models with smaller fine-tuned BERT-based language models. We find that GPT models, including the newer GPT-4, show only mediocre performance in classifying scientific study types, such as RCTs or animal studies. The greatest performance boost, although still moderate, was observed using contextual clues by providing the actual annotation guidelines to GPT. BERT models consistently outperformed GPT models by a substantial margin, reaching F1-scores above 0.8 for most study classes.

4.2 Findings in the context of existing evidence

We present a new human-annotated corpus of scientific abstracts tailored to various types of systematic reviews. This corpus includes abstracts relevant for clinical systematic reviews that often involve RCTs, e.g., of therapeutic interventions.Reference Lasserson, Thomas and Higgins⁴⁷ Additionally, it covers study types pertinent to preclinical systematic reviews,Reference Sena, Currie, McCann, Macleod and Howells⁴⁸ such as animal and in-vitro studies, with sub-classifications specifically for research into therapeutic interventions. This provides much greater granularity than existing corpora,Reference Van IJzendoorn, Habets, Vinkers and Otte¹⁶ making it a resource for, e.g., systematic reviews of preclinical studies.Reference Ineichen, Held, Salanti, Macleod and Wever¹⁹

Transformer-based models like BERT, i.e., a type of deep learning model architecture originally developed for text processing, have been used in the screening of scientific articles by automatically classifying study types,Reference Abogunrin, Queiros, Bednarski, Sumner, Baehrens and Witzmann¹⁷ reaching high sensitivity and specificity for differentiating a smaller number of different study types.Reference Van IJzendoorn, Habets, Vinkers and Otte¹⁶ BERT-models are pre-trained on a vast corpus of text and subsequently fine-tuned to perform specific tasks like text classification.Reference Devlin, Chang, Lee and Toutanova³⁵ Consequently, the BERT family of models has become foundational when fine-tuning models for specific applications like classification, and for a target domain like medicine.Reference Gu, Tinn and Cheng⁴¹ For instance, BioBERT, pre-trained on extensive biomedical corpora, excels in biomedical text mining tasks and can be adapted into even more specialized models.Reference Lee, Yoon and Kim³⁸ Our findings show that SciBERT may perform best in terms of the F1-score, though the small differences and overlapping confidence intervals compared to other BERT models preclude firm conclusions. Interestingly, it has been shown that combining different BERT models in an ensemble could enhance the performance of automated text classification even further, achieving sensitivities for text classification as high as 96%.Reference Qin, Liu and Wang⁴⁹

As expected, GPT-4 showed better performance in study type classification than its predecessor, GPT-3.5, although its overall performance remained moderate. Notably, even advanced prompting strategiesReference Meskó³²^, Reference Sivarajkumar, Kelley, Samolyk-Mazzanti, Visweswaran and Wang⁵⁰ that incorporated Contextual Clues (providing annotation guidelines) and Chain-of-Thought (explicit step-by-step reasoning) provided only marginal improvements in performance. For multi-class classification, the more traditional BERT models consistently outperformed the GPT models across all performance metrics in our sample. This is particularly striking given the exceptional abilities of GPT models in natural language understanding and generation across various applications. For instance, GPT models have shown remarkable results in the USMLE—a comprehensive three-step exam that evaluates clinical competency for medical licensure in the United StatesReference Nori, King, McKinney, Carignan and Horvitz²⁸^, Reference Brin, Sorin and Vaid²⁹^, Reference Taloni, Borselli and Scarsi⁵¹—and have even surpassed human performance in text annotation tasks.Reference Gilardi, Alizadeh and Kubli⁵² This serves as a reminder that, being generative models, their impressive language generation abilities may not necessarily predict good performance on classification tasks, particularly multi-class and domain-specific tasks. Notably, GPT-models have been used with some success to automatically perform title and abstract screening for systematic reviews,Reference Tran, Gartlehner and Yaacoub⁵³^– Reference Khraisha, Put, Kappenberg, Warraitch and Hadfield⁵⁶ though it has not yet reached the level of fully replacing even one human reviewer, let alone two.Reference Tran, Gartlehner and Yaacoub⁵³ We also considered fine-tuning open-weight LLMs. However, GPT models already offer a strong baseline and fine-tuning is costly and time-consuming. With limited academic resources, we chose GPT for its competitive results out of the box.

We showed that using MeSH terms to differentiate between animal and other study types, resulted in more false positive animal-study classifications. This aligns with findings that some MeSH terms are not good discriminators for study types, due to their broader assignment.Reference Neves, Klippert and Knöspel¹³ We did not experiment with integrating MeSH terms into the text for fine-tuning/GPT prediction, as similar approaches were extensively evaluated in related work and showed no performance improvement.Reference Neves, Klippert and Knöspel¹³

A comparison with the labels predicted by the GoldHamster model showed high agreement in identifying animal studies. At the same time, the fine-grained annotations in our dataset and the GoldHamster corpus appear complementary. Notably, our work differentiates animal studies by intervention type (drug versus non-drug), whereas the GoldHamster corpus also covers non-animal study models, such as in-silico approaches. The GoldHamster model assigned a large proportion of studies to the non-specific Other class, suggesting additional relevant categories that are not captured by either dataset. Using the Multi-Tagger annotations revealed that these unlabeled studies mostly concern clinical studies in humans, including case-control or observational studies, which was not the focus in our dataset. For articles annotated as RCTs, Multi-Tagger can provide more detail for the study design, while our annotations help differentiate between RCTs of different intervention types. Applying the optimal Multi-Tagger thresholds would leave over 1,000 studies in our corpus unlabeled, given Multi-Tagger’s exclusive focus on human-relevant articles. Those observations suggest a promising future research direction to merge those related datasets to create a more comprehensive resource.

4.3 Limitations

Our study has a number of limitations: First, the natural distribution of study types in PubMed resulted in an imbalanced dataset, with, e.g., study protocols or non-interventional RCTs being underrepresented. The Remaining class, which includes all studies not fitting into other specified categories, was almost three times larger than the second largest class. Although we maintained this distribution to reflect real-world scenarios, we manually enriched each of the smaller classes with an additional 50 samples to partially mitigate the uneven class distribution and resolve the issue of severely underrepresented classes. Future research could assess how a more balanced class distribution could affect BERT model performance.

Second, although we employed various advanced prompting strategies, we did not explore role modeling with GPT that might enhance its effectiveness (e.g., “You are a systematic reviewer”). Future studies could also look into lightweight model adaptation methods like delta-tuning methods,Reference Ding, Qin and Yang⁵⁷ such as BitFitReference Zaken, Ravfogel and Goldberg⁵⁸ and adaptersReference Houlsby, Giurgiu and Jastrzebski⁵⁹ to optimize performance.

Third, our use of GPT models was limited to GPT-3.5-turbo and GPT-4-turbo-preview, excluding other available API versions. Fourth, it was decided that model calibration (i.e., aligning the model’s confidence scores with actual accuracy) would exceed the technical scope of the study. We recognize, however, that the calibration between confidence and performanceReference Desai and Durrett⁶⁰ would be essential for studies focusing on the creation of robust and optimized models.

Finally, limitations in the study scope and generalizability should be mentioned. While not a limitation of this study specifically, the applicability of automated study-type classification may be restricted to biomedical research. Fields, such as education or ecology, may lack clearly defined study types, limiting the usefulness of such approaches. Furthermore, our sampled articles were retrieved using a query focused on the neuroscience domain. While our aim was to develop study type definitions that are broadly useful for biomedical text mining, the resulting corpus is specific to neuroscience. However, we believe that the key features a classification model uses to identify study types are largely independent of the disease domain. Nonetheless, further experiments are required to assess the generalizability of our approach across other areas of biomedical research.

4.4 Strengths

Our study has two main strengths: First, we introduce a highly granular, dual-annotated abstract corpus designed for fine-tuning LLMs. This allows for more precise adaptations to specific tasks within text classification. Second, we conducted a comprehensive comparison of state-of-the-art GPT models, incorporating advanced prompting strategies, against a range of established BERT models. This approach enhances capabilities beyond those of PubMed indexing or MeSH terms.

4.5 Conclusions

Our study demonstrates that LLMs, particularly those based on BERT, can be used for classification of study types for systematic reviews. These models are especially effective for this task. Specifically, trimming a reference library prior to formal abstract screening with such an approach could dramatically reduce human workload during the abstract screening phase of systematic reviews. As the volume of scientific publications continues to grow, employing such tools will be critical in keeping abreast with scientific evidence.

Acknowledgements

We thank Yvain de Viragh and Naïm de Viragh for their help with conception and data analysis of our study.

Author contributions

Conception and design of study: S.E.D., S.d.V., B.V.I.; Acquisition of data: S.E.D., S.d.V.; Experimental setup and execution: S.E.D., S.d.V., H.H.; Analysis of data: S.E.D., S.d.V., H.H.; Interpretation of data: all authors; Drafting the initial manuscript: S.E.D., S.d.V., H.H., B.V.I. All authors critically revised the draft.

Competing interest statement

The authors declare that no competing interests exist.

Data availability statement

All data and code that support the findings of this study are available in our public GitHub repository (https://github.com/Ineichen-Group/StudyTypeTeller). For any questions regarding the data, metadata, or code analysis, contact the corresponding author, S.E.D.

Funding statement

This study was funded by the Swiss National Science Foundation (Grant No. 407940-206504, to B.V.I.) UZH Digital Entrepreneur Fellowship (to B.V.I.), and Universities Federation for Animal Welfare (to B.V.I.). The sponsors had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Ethics statement

This study only used publicly available data sets and not ethical approval was required.

Supplementary material

The supplementary material for this article can be found at https://doi.org/10.1017/rsm.2025.10031.

Footnotes

This article was awarded Open Data and Open Materials badges for transparent practices. See the Data availability statement for details.

†

S.E.D. and S.d.V. share first authorship.

References

Cumpston, M, Li, T, Page, MJ, et al. Updated guidance for trusted systematic reviews: A new edition of the cochrane handbook for systematic reviews of interventions. Cochrane Database of Syst. Rev. 2019;10(10):ED000142. Ed. by C. E. Unit. https://doi.org/10.1002/14651858.ED000142.CrossRef Google Scholar PubMed

Marianna, R, Doneva, SE, Howells, DW, Leenaars, CHC, Ineichen, BV. Summer school for systematic reviews of animal studies: fostering evidence-based and rigorous animal research—Meeting report. ALTEX 2024;41(1):131-134. https://doi.org/10.14573/altex.2310251.Google Scholar

Bannach-Brown, A, Hair, K, Bahor, Z, Soliman, N, Macleod, M, Liao, J. Technological advances in preclinical meta-research. BMJ Open Sci. 2021;5(1):e100131. https://doi.org/10.1136/bmjos-2020-100131.CrossRef Google Scholar PubMed

Marshall, IJ, Johnson, BT, Wang, Z, Rajasekaran, S, Wallace, BC. Semi-Automated evidence synthesis in health psychology: Current methods and future prospects. Health Psychol. Rev. 2020;14(1):145–158. https://doi.org/10.1080/17437199.2020.1716198.CrossRef Google Scholar PubMed

Borah, R, Brown, AW, Capers, PL, Kaiser, KA Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open 2017;7(2):e012545. https://doi.org/10.1136/bmjopen-2016-012545.CrossRef Google Scholar PubMed

Ineichen, BV, Rosso, M, Macleod, MR. From data deluge to publomics: How AI can transform animal research. Lab Animal 2023;52(10):213–214. https://doi.org/10.1038/s41684-023-01256-4.CrossRef Google Scholar

Lowe, HJ, Barnett, GO. Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. JAMA 1994;271(14):1103–1108.CrossRef Google Scholar PubMed

Wang, S, Scells, H, Koopman, B, Zuccon, G. Automated MeSH term suggestion for effective query formulation in systematic reviews literature search. Intell. Syst. Appl. 2022;16:200141.Google Scholar

Marshall, IJ, Noel-Storr, A, Kuiper, J, Thomas, J, Wallace, BC. Machine learning for identifying randomized controlled trials: An evaluation and practitioner’s guide. Res. Synth. Methods 2018;9(4):602–614.10.1002/jrsm.1287CrossRef Google Scholar PubMed

Cohen, AM, Smalheiser, NR, McDonagh, MS, et al. Automated confidence ranked classification of randomized controlled trial articles: An aid to evidence-based medicine. J. Amer. Med. Inf. Assoc. 2015;22(3):707–717.10.1093/jamia/ocu025CrossRef Google Scholar PubMed

Cohen, AM, Schneider, J, Fu, Y, et al. Fifty ways to tag your pubtypes: Multi-tagger, a set of probabilistic publication type and study design taggers to support biomedical indexing and evidence-based medicine. medRxiv, 2021. 2021–07.10.1101/2021.07.13.21260468CrossRef Google Scholar

Vaswani, A, Shazeer, N, Parmar, N, et al. Attention is all you need. In: Neural Information Processing Systems. 2017. https://api.semanticscholar.org/CorpusID:13756489.Google Scholar

Neves, M, Klippert, A, Knöspel, F, et al. Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments. J. Biomed. Semant. 2023;14(1):13.CrossRef Google Scholar PubMed

Sun, X, Li, X, Li, J, et al. Text classification via large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8990–9005, Singapore. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-emnlp.603.CrossRef Google Scholar

Kumar, A, Sharma, R, Bedi, P. Towards optimal NLP solutions: Analyzing GPT and LLaMA-2 models across model scale, dataset size, and task diversity. Eng., Technol. Appl. Sci. Res. 2024;14(3):14219–14224. https://doi.org/10.48084/etasr.7200.CrossRef Google Scholar

Van IJzendoorn, DGP, Habets, PC, Vinkers, CH, Otte, WM. Clinical study type classification, validation, and PubMed filter comparison with natural language processing and active learning. 2022. https://doi.org/10.1101/2022.11.01.22281685.CrossRef Google Scholar

Abogunrin, S, Queiros, L, Bednarski, M, Sumner, M, Baehrens, D, Witzmann, A. Are machines-learning methods more efficient than humans in triaging literature for systematic reviews? 2021. https://doi.org/10.1101/2021.09.30.462652.CrossRef Google Scholar

Van Luijk, J, Bakker, B, Rovers, MM, Ritskes-Hoitinga, M, De Vries, RBM, Leenaars, M Systematic reviews of animal studies; missing link in translational research? PLoS One 2014;9(3):e89981. https://doi.org/10.1371/journal.pone.0089981.CrossRef Google Scholar PubMed

Ineichen, BV, Held, U, Salanti, G, Macleod, MR, Wever, KE. Systematic review and meta-analysis of preclinical studies. Nat. Rev. Methods Primers 2024;4(1):72.10.1038/s43586-024-00347-xCrossRef Google Scholar

Fries, J. A., Seelam, N, Altay, G, et al. Dataset debt in biomedical language modeling. In: Challenges & Perspectives in Creating Large Language Models. 2022. https://openreview.net/forum?id=HRfzInfr8Z9.10.18653/v1/2022.bigscience-1.10CrossRef Google Scholar

Névéol, A, Doğan, RI, Lu, Z. Author keywords in biomedical journal articles. In: Mendonça EA, Tarczy-Hornoch P (editors). AMIA Annual Symposium Proceedings. Vol. 2010. American Medical Informatics Association; 2010:537.Google Scholar

Montani, I, Honnibal, M. Prodigy: A modern and scriptable annotation tool for creating training data for machine learning models. https://prodi.gy/.Google Scholar

Brown, T., Mann, B, Ryder, N, et al. Language models are few-shot learners. In: Larochelle, H, Ranzato, M, Hadsell, R, Balcan, MF, Lin, H, eds. Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc.; 2020:1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.Google Scholar

OpenAI. GPT-4 technical report. 2024. https://doi.org/10.48550/arXiv.2303.08774.CrossRef Google Scholar

Sandmann, S, Riepenhausen, S, Plagwitz, L, Varghese, J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat. Commun. 2024;15(1):2050. https://api.semanticscholar.org/CorpusID:268261645.CrossRef Google Scholar PubMed

Kipp, M. From GPT-3.5 to GPT-4.o: A leap in AI’s medical exam performance. Information 2024;15:543.10.3390/info15090543CrossRef Google Scholar

OpenAI API . Reference page. https://platform.openai.com/docs/api-reference/introduction. Accessed: 2024-05-27.Google Scholar

Nori, H, King, N, McKinney, SM, Carignan, D, Horvitz, E. Capabilities of GPT-4 on Medical Challenge Problems. arXiv:2303.13375, 2023. arXiv:2303.13375 [cs]. http://arxiv.org/abs/2303.13375.Google Scholar

Brin, D, Sorin, V, Vaid, A, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci. Rep. 2023;13(1):16492. https://doi.org/10.1038/s41598-023-43436-9.CrossRef Google Scholar PubMed

OpenAI. OpenAI: Prompt engeneering guide. https://platform.openai.com/docs/guides/prompt-engineering/strategy/test-changes-systematically. (accessed: 10.01.2024, 12:10).Google Scholar

Nori, H, Lee, YT, Zhang, S, et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv:2311.16452, 2023. arXiv:2311.16452 [cs]. http://arxiv.org/abs/2311.16452.Google Scholar

Meskó, B. Prompt engineering as an important emerging skill for medical professionals: Tutorial. J. Med. Internet Res. 2023;25:e50638. https://doi.org/10.2196/50638.CrossRef Google Scholar

Jimenez Gutierrez, B, McNeal, N, Washington, C, et al. Thinking about GPT-3 In-context learning for biomedical IE? Think again. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4497–4512, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.findings-emnlp.329.CrossRef Google Scholar

Liu, J, Shen, D, Zhang, Y, Dolan, B, Carin, L, Chen, W. What makes good in-context examples for GPT-

$3$ ? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.deelio-1.10.CrossRef Google Scholar

Devlin, J, Chang, M-W, Lee, K, Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423.CrossRef Google Scholar

Artetxe, M, Du, J, Goyal, N, Zettlemoyer, L, Stoyanov, V. On the role of bidirectionality in language model pre-training. In: Goldberg, Y, Kozareva, Z, Zhang, Y. eds. Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics; 2022:3973–3985. https://doi.org/10.18653/v1/2022.findings-emnlp.293.CrossRef Google Scholar

Beltagy, I, Lo, K, Cohan, A. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1371.CrossRef Google Scholar

Lee, J., Yoon, W, Kim, S, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020;36(4):1234–1240. Ed. by J. Wren, https://doi.org/10.1093/bioinformatics/btz682.CrossRef Google Scholar PubMed

Huang, K, Altosaar, J, Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv:1904.05342, 2020. arXiv:1904.05342 [cs]. http://arxiv.org/abs/1904.05342.Google Scholar

Chakraborty, S, Bisong, E, Bhatt, S, Wagner, T, Elliott, R, Mosconi, F. BioMedBERT: A pre-trained biomedical language model for QA and IR. In: Scott, D, Bel, N, Zong, C, eds. Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics; 2020:669–679. https://doi.org/10.18653/v1/2020.coling-main.59.CrossRef Google Scholar

Gu, Y, Tinn, R, Cheng, H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthcare 2022;3(1):1–23. https://doi.org/10.1145/3458754.CrossRef Google Scholar

Yasunaga, M, Leskovec, J, Liang, P. LinkBERT: Pretraining language models with document links. In: Muresan, S, Nakov, P, Villavicencio, A, eds. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics; 2022:8003–8016. https://doi.org/10.18653/v1/2022.acl-long.551.CrossRef Google Scholar

Menke, JD, Kilicoglu, H, Smalheiser, NR. Publication Type Tagging using Transformer Models and Multi-Label Classification. medRxiv, 2025. https://doi.org/10.1101/2025.03.06.25323516.CrossRef Google Scholar

Brown, LD, Cai, TT, DasGupta, A. Interval estimation for a binomial proportion. Stat. Sci. 2001;16(2):101–133.CrossRef Google Scholar

Takahashi, K, Yamamoto, K, Kuchiba, A, Koyama, T. Confidence interval for micro-averaged F 1 and macro-averaged F 1 scores. Appl. Intell. 2022;52(5):4961–4972.CrossRef Google Scholar PubMed

Gildenblat, J. A python library for confidence intervals. 2023. https://github.com/jacobgil/confidenceinterval Google Scholar

Lasserson, TJ, Thomas, J, Higgins, JP. Starting a review. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions. John Wiley & Sons, Ltd; 2019:1–12. Chap. 1. https://doi.org/10.1002/9781119536604.ch1.Google Scholar

Sena, ES, Currie, GL, McCann, SK, Macleod, MR, Howells, DW. Systematic reviews and meta-analysis of preclinical studies: Why perform them and how to appraise them critically. J. Cerebral Blood Flow Metab. 2014;34(5):737–742. https://doi.org/10.1038/jcbfm.2014.28.CrossRef Google Scholar PubMed

Qin, X., Liu, J, Wang, Y, et al. Natural language processing was effective in assisting rapid title and abstract screening when updating systematic reviews. J. Clin. Epidemiol. 2021;133:121–129. https://doi.org/10.1016/j.jclinepi.2021.01.010.CrossRef Google Scholar PubMed

Sivarajkumar, S, Kelley, M, Samolyk-Mazzanti, A, Visweswaran, S, Wang, Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: algorithm development and validation study. JMIR Med. Inf. 2024;12:e55318. https://doi.org/10.2196/55318.CrossRef Google Scholar PubMed

Taloni, A, Borselli, M, Scarsi, V, et al. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American academy of ophthalmology. Sci. Rep. 2023;13(1):18562. https://doi.org/10.1038/s41598-023-45837-2.CrossRef Google Scholar PubMed

Gilardi, F, Alizadeh, M, Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Nat. Acad. Sci. 2023;120(30):e2305016120. https://doi.org/10.1073/pnas.2305016120.CrossRef Google Scholar PubMed

Tran, V-T, Gartlehner, G, Yaacoub, S, et al. Sensitivity and specificity of using GPT-3.5 turbo models for title and abstract screening in systematic reviews and meta-analyses. Ann. Internal Med. 2024;177(6):791–799.CrossRef Google Scholar PubMed

Syriani, E, David, I, Kumar, G. Assessing the ability of chatgpt to screen articles for systematic reviews. arXiv preprint arXiv:2307.06464, 2023.Google Scholar

Guo, E, Gupta, M, Deng, J, Park, Y.-J, Paget, M, Naugler, C. Automated paper screening for clinical reviews using large language models: Data analysis study. J. Med. Internet Res. 2024;26:e48996.10.2196/48996CrossRef Google Scholar PubMed

Khraisha, Q, Put, S, Kappenberg, J, Warraitch, A, Hadfield, K. Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Research Synthesis Methods 2024;15(4):616-626.10.1002/jrsm.1715CrossRef Google Scholar

Ding, N, Qin, Y, Yang, G, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 2023;5(3):220–235. https://doi.org/10.1038/s42256-023-00626-4.CrossRef Google Scholar

Zaken, EB, Ravfogel, S, Goldberg, Y. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, Dublin, Ireland. Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-short.1.CrossRef Google Scholar

Houlsby, N, Giurgiu, A, Jastrzebski, S, et al. Parameter-efficient transfer learning for NLP. Proceedings of the 36th International Conference on Machine Learning (ICML) 2019;97:2790–2799. http://arxiv.org/abs/1902.00751.Google Scholar

Desai, S, Durrett, G. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.21.CrossRef Google Scholar

Table 1 Annotated corpus: The table presents key statistics of the dataset after the stratified splitting into train, validation, and test sets

Table 2 Overview of employed prompting strategies

Table 3 Performance metrics for GPT-3.5-turbo and GPT-4-turbo-preview (for selected prompting strategy) as well as for BERT models (multi-class classification)

Figure 1 Abstract length per class, calculated on the whole dataset before splitting into train, validation, and test sets.

Figure 2 Per-class performance comparison between the best performing prompting strategy for GPT-3.5 and GPT-4 (P2_H_b3, CC), and SciBERT with 95%-confidence intervals.

Figure 3 Confusion matrices of the best-performing (a) prompting strategy for the GPT-based model (GPT-4, P2_H_b3, CC) and (b) the BERT-based models (SciBERT).

Figure 4 Overview of top ten predicted labels based on the GoldHamster or Multi-Tagger corpus for (a, b) the full dataset, (c) the abstracts annotated as Remaining in our corpus, and (d) the abstracts annotated as Remaining in our corpus and as human by GoldHamster, highlighted in orange in (c). Subfigure (e) shows the top ten labels from Multi-Tagger containing Randomized Control Trials, abbreviated as RCT, and (f) the corresponding assigned labels to this set of articles in our dataset.

Table 4 Top performing models and strategies among the GPT and BERT models, evaluated in the multi-class classification task

Emilova Doneva et al. supplementary material

File 1.1 MB

Article contents

StudyTypeTeller—Large language models to automatically classify research study types for systematic reviews

Abstract

Keywords

Information

Highlights

What is already known?

What is new?

Potential impact for RSM readers

Abbreviations

1 Introduction

2 Materials and methods

2.1 Approach and study design

2.2 Data

2.2.1 Data collection

2.2.2 Data annotation

2.2.3 Dataset enrichment

2.2.4 Data splits for model training and evaluation

2.3 GPT-based models

2.3.1 Setup

2.3.2 Prompting strategy

2.3.3 Experiments

2.4 BERT-based models

2.4.1 Setup

2.4.2 Experiments

2.5 Comparative analysis with existing datasets

2.6 Performance evaluation

3 Results

3.1 Corpus

3.2 GPT models

3.3 BERT models

3.4 GPT versus BERT

3.5 Comparison to related work

4 Discussion

4.1 Main findings

4.2 Findings in the context of existing evidence

4.3 Limitations

4.4 Strengths

4.5 Conclusions

Acknowledgements

Author contributions

Competing interest statement

Data availability statement

Funding statement

Ethics statement

Supplementary material

Footnotes

References

Emilova Doneva et al. supplementary material

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests