Hostname: page-component-857557d7f7-bkbbk Total loading time: 0 Render date: 2025-12-06T21:40:51.625Z Has data issue: false hasContentIssue false

Aggregating and analysing clinical trials data from multiple public registers using R package ctrdata

Published online by Cambridge University Press:  04 December 2025

Ralf Herold*
Affiliation:
Regulatory Science and Innovation Taskforce, European Medicines Agency, Netherlands
Rights & Permissions [Opens in a new window]

Abstract

The ctrdata package has been created to boost the use of data available in public registers of clinical trials. It enables user-friendly, reproducible workflows to identify trials of interest, download protocol- and results-related data, and conduct sophisticated analyses, across multiple registers and trials. ctrdata works in the widely used R environment, and its databases can be used with other tools. The package is open source with a permissive licence, to facilitate collaboration.

This report provides an overview of ctrdata, including its implementation, cases of interest to researchers in public health, medicines, and regulatory science, as well as potential limitations and further developments. At this time, ctrdata works with the European Union (EU) Clinical Trials Information System (CTIS), the EU Clinical Trials Register (EUCTR), the US Clinicaltrials.Gov (CTGOV), and the ISRCTN—the UK’s Clinical Study Registry. The registers are complementary in scope and scientific value, yet differences in data models, variable definitions, search parametrisations, and retrieval options hamper efficient scientific workflows, calling for a scientific-technical, programmatic solution and driving the development of ctrdata.

By employing ctrdata to comprehensively use and easily leverage trial register data, researchers can effectively address a variety of questions, gain useful insights into evolving policies and practices of drug development, and inform further clinical research. Patients and their organisations, developers, policymakers, and other interested parties can build on ctrdata to create solutions for their use cases.

Information

Type
Software Focus
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Open Practices
Open materials
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Research Synthesis Methodology

Highlights

What is already known?

  • Clinical trials are a primary approach for generating evidence on health interventions. Trials are regulated to ensure participants’ well-being, scientific relevance, and transparency.

  • Registers make public rich trial data, but tools for efficiently using the data are lacking, impacting reproducible, deep information synthesis and learning.

What is new?

  • ctrdata, an open-source R package, enables using all publicly available data and documents from four registers (the EU Clinical Trials Information System [CTIS], the EU Clinical Trials Register [EUCTR], the US Clinicaltrials.Gov [CTGOV], and ISRCTN—the UK’s Clinical Study Registry). It supports research steps from identifying and storing trials of interest, deduplication, scrutinising structure, extracting fields, to analysing user- or pre-defined scientific and operational concepts.

Potential impact for RSM readers

  • With ctrdata, researchers can easily implement a programmatic workflow to investigate trials in depth. ctrdata keeps pace with register changes and user requirements. Its databases can be used with any system.

  • Beyond drug development, ctrdata is relevant for patient access, methodology research, health policy, and outcomes research.

1. Introduction

Clinical research should be well informed by evolving experience, for which public registers have become a transparent source and comprehensive reference. However, it is increasingly difficult to scrutinise trials and understand their design, conduct, and results, because their number and complexity is growing fast.Reference Getz, Stergiopoulos and Short 1

Registers are an important means by which sponsors and regulators increase the transparency on clinical trialsReference Egger, Herold, Rodriguez, Manent, Sweeney and Saint Raymond 2 for the benefit of patients, health professionals, researchers, and developers, whether from the academic or for-profit sector. Registers offer more or less user-friendly web interfaces, for manual finding and reviewing of specific trials of interest. Yet, surveys of the general public in European countries led Parsons et al. to conclude that ‘public interest in medicines R&D was greater than public knowledge, which suggests that attempts to increase public knowledge will be welcomed’.Reference Parsons, Starling and Mullan-Jensen 3 With such attempts, researchers, health professionals, and patients ‘can identify knowledge gaps that need to be filled with new trials’.Reference Califf, Cutler, Marston and Meeker-O’Connell 4

There is, however, a lack of tools that enable an efficient scientific-technical or programmatic approach to analyse register data from individual and sets of trials. A report on the underutilisation of registers noted the extra effort and time required for manual screening of trials.Reference Alqaidoom, Nguyen, Awadh and Page 5 Linkage and synthesis of trial data across registers is hampered by differences between the registers’ data models, that is, by different variable structures and value lists used for corresponding information concepts. While the WHO International Clinical Trials Registry Platform (ICTRP) could be seen as an example of data linkage, it covers only a limited subset (24 items in the ‘WHO Trial Registration Data Set’), importantly without any results-related data. A screening of PubMed in July 2025 for meta-analyses or systematic reviews from the past ten years and the term ‘Clinicaltrials.Gov’, ‘ICTRP’, or ‘EudraCT’ yielded around 10500, 1800, and 60 results, respectively, with only some 35 results using both the US Clinicaltrials.Gov (CTGOV) and the EU Clinical Trials Register (EUCTR). 6 In the scientific literature, there seems a scarcity of reports of EUCTR and cross-register analyses. The complexity and continual evolution of register data are part of the technical challenges with data fragmentation across health data systems and with advancing data linkage between such systems.

The package ctrdata is a recent yet mature tool that facilitates accessing all public protocol- and result-related information on clinical trials in registers.Reference Herold 7 The functionality covers identifying, querying, downloading, aggregating, and analysing data across registers, including historical versions of trials and trial-related documents, as far as publicly available in registers. The package ctrdata is available as open source with a permissive licence, and collaborations are welcome to increase its usefulness as a tool. Package ctrdata works with the EUCTR, the EU Clinical Trials Information System (CTIS), the CTGOV, and ISRCTN—the UK’s Clinical Study Registry.

The intention with ctrdata is to maximise the usefulness of trial registers for increasing public knowledge, for participation in research, for informing on health interventions, for decision-making of patients and professionals, and for efficient future clinical research.

Creating package ctrdata was also motivated by questions in regulatory science that led to research activities within the European Medicines Agency (EMA), such as on the relation of juvenile animal studies and clinical trials in children 8 and on trends in clinic research during the COVID-19 pandemic .Reference Lasch, Psarelli and Herold 9

The leading idea is that ctrdata encompasses and abstracts register-specific parts of a workflow, on top of which users can build their generic trial workflow parts and applications.

This is the first report on ctrdata, and it covers the technical background, several use cases of likely interest to researchers in public health, medicines, and regulatory science, and a discussion of potential limitations and future developments.

2. Implementation

Package ctrdata provides a system for clinical trials data that includes loading from registers, storing and extracting for analysis and re-use.

An overview on the main components that ctrdata provides and that ctrdata uses is in Figure 1.

Figure 1 Overview on using R package ctrdata. The arrow means ‘makes use of’. The user can execute functions in package ctrdata that query clinical trial registers (CTRs) and load data of trials of interest. Such functions include ‘ctrGenerateQueries()’, ‘ctrLoadQueryIntoDb()’, and ‘ctrFindActiveSubstanceSynonyms()’. These functions make use of the application programming interfaces (APIs) and web interfaces of four CTRs. The user can inspect trials with ctrdata function ‘ctrShowOneTrial()’ and select fields with ‘dbFindFields()’. Package ctrdata uses package nodbi for storing clinical trial data in SQLite, DuckDB, MongoDB, or PostgreSQL. Data of interest are generated using ctrdata functions ‘dbGetFieldsIntoDf()’ (which extracts data from the database, combines data from different registers into a data set, and calculates concepts across trials), ‘dfTrials2Long()’, and ‘dfName2Value()’ (which reshape a data set and select nested fields based on field identifiers).

The following sections discuss main ctrdata components.

2.1. Registers

When the development of ctrdata was started in 2015, making public information on registered clinical trials had been required by US legislation since fifteen years,Reference Tse, Fain and Zarin 10 by medical journal editors for ten years, 11 and by legislation in the European Union (EU) for five years.Reference Egger, Herold, Rodriguez, Manent, Sweeney and Saint Raymond 2

At that time, CTGOV provided an API, 12 and both CTGOV and EUCTR used XML schemas (Extensible Markup Language) 13 for data models, published them for information of data providers, and continually updated them to meet business requirements and changes in the legislative frameworks. 14 , 15

Package ctrdata supported these two registers since 2015. The ISRCTN is supported since 2021, when it was started to be used for statutory purposes in the UK; it provides an API and XML data. 16

Since March 2023, ctrdata supports the EU CTIS, both for data made public before and after its relaunch in mid-2024. 17 CTIS is to be used for new clinical trials since February 2022. Data from CTIS are derived in JSON format (JavaScript Object Notation 18 ) from its public API that feeds the register’s web interface.

The current main characteristics of registers are summarised in Table 1.

Table 1 Overview of registers supported by ctrdata (numbers rounded to three significant digits; XML, extensible markup language; JSON, JavaScript Object Notation; ‘other types of clinical studies’ refers to studies of medical devices, behavioural and other health interventions, observational, non-interventional, and other studies)

2.2. Principles

Several principles evolved when developing package ctrdata.

All details are downloaded for clinical trials of interest, because only a complete public record is an accurate representation of the trial. Since the registers differ in scope, legal, and regulatory purposes, their content is complementary and ctrdata thus should work with several and an increasing number of registers.

Before analyses, trial information should first be downloaded and stored in a database, because the set of trials of interest is often a union set from different search queries and possibly different registers, and because offline availability of data for analysis is useful.

The data models that are implicit in data as retrieved from the different registers are retained by ctrdata, because differences between the data structure and value sets of different registers can well be handled at the time of analysis, and because any mapping to a putative target data model would be a goal suitable for an international harmonisation organisation.

Against this technical background and principles, ctrdata was implemented in the R environment, 19 which has a broad user base and has an extensive support for structured data, network operations, dependency management, and quality assurance.

At the same time, the users of ctrdata should be provided with functions that are simple and cover all relevant steps, without duplicating functions in R or one of its many extension packages. Main functions in ctrdata are listed in Table 2 in the order of a potential workflow, together with the number of the use case in Section 3 that exemplifies the function.

Table 2 Main functions in ctrdata in order of a potential workflow

Note: A full list of functions is part of the documentation website at https://rfhb.github.io/ctrdata/.

2.3. Analysis concepts

Package ctrdata includes functions that implement specific analysis concepts (Table 3). Concepts of clinical trials, such as their start date or their number of arms/groups with different investigational medicines, require analysing several fields against various criteria. However, the structure and the value sets of data fields differ between the registers.

Table 3 Overview on clinical trial concept functions implemented in ctrdata

Note: The refinement of some of the concepts is informed by ongoing research and use cases.

To address this situation, 20 trial concepts, pre-defined in ctrdata, are offered to simplify and accelerate a user’s analysis workflow, thereby increasing analysis consistency and reproducibility.

Some trial concepts can build on clear definitions and close similarities of registers; thus, concepts such as the trial phase, trial population, number of sites, and status of recruitment when loading the trial can be calculated with some confidence, yet users should note the respective help texts, which include any caveats such as that EUCTR does not have numbers of sites for non-EEA (European Economic Area) countries .

Where definitions are not closely similar, an operational definition was chosen to create trial concepts of interest; an example is the sponsor type at the level of the trial, for which a new value of ‘mixed’ is calculated if a trial’s sponsors include commercial and non-commercial entities.

Other trial concepts reflect the author’s proposals for temporary approximations of less well-defined concepts, such as trial objectives (e.g. dose-finding, pharmacodynamics) and ‘f.likelyPlatformTrial()’, with the function name flagging the uncertainties.

The trial concepts in ctrdata (all described in Table 3) have not been validated with any formal approach but have been checked for plausibility and against common sense expectations. Where possible, the implementation of a trial concept is based on documented current understanding, on public data models, or on scientific papers, as relevant. Users are invited to note the help texts of the concepts, which mention the register fields and any caveats, to review the function logic in its code (as in Section 3.3), and to raise an issue or to contribute improving any trial concept in the public repository of ctrdata (see data availability statement).

2.4. Storage

For storage of trial data, a document-centric approach was chosen because all data on a particular trial represent a self-standing document, where documents can differ in structure and do not require to pre-specify a schema.

The R package nodbi is used as a connector to document-centric databases and was extended to work, in addition to MongoDB, with PostgreSQL, RSQLite, and DuckDB as backend.Reference Herold, Chamberlain, FitzJohn and Ooms 20 The latter are SQL databases but have functions for handling JSON which are abstracted by nodbi so that all four backends can be used interchangeably, without further changes in R scripts. Since RSQLite and DuckDB are available for all R platforms as local databases, their use with package ctrdata is likely of general interest.

Databases created with ctrdata can be accessed with other R packages and with other languages, such as Python, Julia, or JavaScript. Furthermore, using a MongoDB server enables to execute analyses directly on the server, such as efficient aggregation pipelines as shown in one of ctrdata’s vignettes. 21

3. Use cases

The ten use cases in this section are diverse examples to illustrate how research questions can be addressed with ctrdata. A general workflow is shown in the sequence of functions in Table 2. The results of the use cases are not commented or interpreted, since the intention is to exemplify just the functionality, without scientific review or discussion.

In R, package ctrdata is installed as follows:

Then, the package can be loaded, here together with the package ‘dplyr’ for pretty printing and scripting:

Information on CTRs is provided in package ctrdata, such as links to their documentation, reference pages, data structure, and value set descriptions:

Importantly, users can open empty search and expert search pages as well as review the copyrights pages of registers for their acknowledgement before going further as follows:

All registers (except CTIS) show in a web browser the URL that represents the user’s current trial search. This URL can be manually copied by the user and pasted as input for ctrdata to load the trial data, as exemplified below.

For convenience, a script is provided alongside package ctrdata that can be installed in the web browser, where it automatically copies register search URLs to the clipboard of the user’s device.Reference Herold 22 To this end, the user would first install the Tampermonkey browser extension and then import the script located at https://raw.githubusercontent.com/rfhb/ctrdata/master/tools/ctrdataURLcopier.js. The browser extension and the script can be disabled and enabled by the user at any time.

This script is particularly useful with CTIS, where it can modify the URL as shown in the web browser to reflect the user’s parameters for searching this register. Additionally, this script can show search results in CTIS when opening URLs such as https://euclinicaltrials.eu/ctis-public/search#searchCriteria={“status”:[3,4]}; without the script, such query URLs have no effect in CTIS at this time.

3.1. Generate queries and count trials

Research often starts with developing a search strategy for information of interest. To facilitate searching in different CTRs, the user can provide high-level search parameters to function ‘ctrGenerateQueries()’. The parameters are translated into the different approaches of the trial registers for producing search results:

The function generates by default queries limited to interventional studies with medicines, referred to as ‘clinical trials’ throughout this article. The function parameter ‘onlyMedIntervTrials’ can be set to FALSE to remove this limitation and find all types of studies available in the register. Other interesting parameters of function ‘ctrGenerateQueries()’ are, for example, ‘countries’, ‘searchPhrase’, ‘condition’, ‘phase’, and various dates.

The function generates a (named) vector of hyperlinks specific to the registers. The links can be used to open the registers’ results pages so that the user can check and refine the queries:

A next step can be to determine the number of trials that can be obtained with the queries. To this end, function ‘ctrLoadQueriesIntoDb()’ is executed, and this emits messages for the user’s information about the data exchange with the CTR, including the number of trials found.

As a more advanced programming pattern, function ‘ctrLoadQueryIntoDb()’ can be applied to all queries to store return values in a list, from which the number of trials can be extracted as follows:

Note that the number of EUCTR records reflects the number of countries involved; it is thus a multiple of the number of trials. Also, note two types of queries are provided for CTGOV, including the register’s expert search page for interactively composing and executing more complicated and nested queries; they result in the same set of trials and thus, one query can be removed for subsequent use cases:

Importantly, users retain full control over queries to match their specific research interest, for example by modifying the query strings as one would modify any other string in R. Function ‘ctrGenerateQueries()’ will remain useful to get started, since searches much differ between registers.

3.2. Download trial data for analyses

The trials that have been identified with a search strategy have to be retrieved and downloaded, in order to refine the set of trials of interest and to analyse any of their details. One of the principles recognised by ctrdata (see Section 3.2) is that a final set of trials of interest often results from complementary queries in the same or in different registers.

First, a connection to a database is created, here SQLite (DuckDb, MongoDb, and PostgreSQL can also be used), for which the corresponding R package needs to be installed. A collection (database table) is specified to hold data of trials of interest:

Second, the queries defined above are used to download the trial data. Function ‘ctrLoadQueryIntoDb()’ here is applied to all queries:

For the total of almost 1700 trial records from four registers, the downloading takes around 100s (at a maximum bandwidth of 10 MB/s; ctrdata throttles the number of requests per time period).

Any number of additional queries can be loaded into the same collection, for example:

Function ‘ctrLoadQueryIntoDb()’ can also repeat a query to load trials that were updated or are new since the last time the query was loaded. Since this is a main function of package ctrdata, its full signature provides an overview of the options with which a user can tailor the data to be loaded to their needs. Some of the options are discussed in the following use cases.

The function returns the number and identifiers of trials that were successfully loaded or failed to load, and the query that was used.

Further for documentation and reproducibility, ctrdata includes metadata in the database collection whenever ‘ctrLoadQueryIntoDb()’ is run so that users can check and re-use:

A second main function of ctrdata has the purpose to provide a user-friendly table (data frame in R) from any data in a database collection, which can then be tabulated, for example:

Since trial data are extensive and hierarchically structured, a user can explore their structure and value sets of individual trials with ctrdata, which provides an interactive browser widget to identify individual data fields of interest:

Fields of interest can also be found across a sample or all trials from different registers in a collection:

3.3. Analyse using pre-defined trial concepts

Beyond individual fields of interest as identified above, ctrdata comes with various trial concepts that are already implemented as functions for selecting and analysing fields from different registers.

It is seemingly simple to calculate the start date of a trial and its current recruitment status, yet it involves more than 20 fields across the registers to calculate the new columns that correspond to these two concepts for the data frame provided by ‘dbGetFieldsIntoDf()’:

The pre-defined trial concepts much simplify a user’s workflow, and an overview on the currently 20 functions is available in Table 3 and here:

The function names indicate if a trial concept can be calculated exactly as above or can only be approximated (e.g. ‘f.likelyPlatformTrial()’). Users can inspect how a concept is calculated by calling the name of the function, for example:

A particular trial may have been registered in more than one register, and in EUCTR one trial has one record for every participating EU Member State. Therefore, ctrdata provides the trial concept ‘f.isUniqueTrial()’, which helps to identify and select only unique trials before further analyses, to avoid double-counting:

Alternatively to the above, at the time of loading trials from EUCTR, it is possible to just load a single record for any trial, by calling function ‘ctrLoadQueryIntoDb()’ with parameter ‘euctrprotocolsall’ set to FALSE. This setting can be useful when there are no questions about differences between Member States’ versions, such as dates, authorisation decisions, ethics opinions, and trial end.

In the example above, phases of medicine development are calculated based on values recorded in registers, and sample sizes are calculated to reflect the planned or the achieved number of participants, depending on the status of recruitment. After calculating the trial concepts, further analyses become reasonably simple, such as exploring associations (Figure 2):

Figure 2 Boxplot of sample size by phase of trial.

3.4. Merge and analyse information

Function ‘dfMergeVariablesRelevel()’ can be used for combining arbitrary fields of the same type from different registers into a new variable, here in an example for country data.

Clinical trials often span countries and even regions, in particular when conditions under study are rare, when a large number of participants is sought, or when the performance of interventions in the context of local health systems is to be analysed. 23 All trial registers supported by ctrdata provide data on countries, and CTIS and CTGOV provide data on individual sites, including their location and contacts. The respective fields are extracted into a data frame and then are merged into a new variable, concatenated with ‘ / ’:

The new variable can be used for example for a cross-tabulation against all countries involved in the trials of interest, or for counting the sites per country as follows:

Besides analysing country data, function ‘ctrGenerateQueries()’ can be used for searching trials that are conducted in countries specified by the user.

3.5. Analyse text data that describe endpoints

Endpoints or outcomes investigated in clinical studies are so far represented in registers as textual descriptions, and there is no controlled terminology established globally or in any register. A particular endpoint is usually described with components covering a title or short description, its operational definition, and the time points when it is evaluated, and together they roughly correspond to the variable, one of the four estimand components.Reference Pétavy, Guizzaro, dos, Teerenstra and Roes 24 The trial concept ‘f.primaryEndpointDescription()’ can provide the outcome variable in a single string; however, text analysis methods then need to be applied.

For example, questions may concern in how far difference-from-baseline variables continue to be used. Here, for unique efficacy trials with a phase 3 label, the text analysis method is using a regular expression to determine if the primary endpoint of a trial likely corresponds to such a variable or not, cross-tabulating with the type of sponsor:

Other research questions require an abstraction or categorisation of endpoint variables, such as questions about the type of endpoint.Reference Folgori, Bielicki and Ruiz 25 Here, package ctrdata can be used for obtaining and pre-processing endpoint data (shown above), which then could be fed into a suitable large language model to predict the sought category:

3.6. Results-related primary endpoint data

In previous sections, the retrieval from EUCTR did not include result-related data, because this can be quite time-consuming for this register. Results-related data are always retrieved from CTGOV. From CTIS and ISRCTN, there are no results available in a structured format for the foreseeable time.

Retrieving results could have been done already during the first loading of trial data; newly loading trial data that include results overwrites any records of the same trials that were previously loaded (while maintaining user annotations such as those used below):

Package ctrdata provides data on primary endpoints by analysing various fields in different registers. This simplifies a user’s workflow, for example to explore details of results of null hypothesis significance testing (NHST, Figure 3):

Figure 3 Cumulative density of reported p-values (dotted line, p = 0.05).

Similarly, statistical methods used for primary endpoint analysis can be tabulated:

3.7. Changes during trial conduct

With the availability of historic versions of registered trials, changes over time can be identified by comparing data of interest across versions. At this time, only CTGOV directly provides historic versions; in addition, ctrdata can create historic versions also for CTIS, when re-running a previous query. Since retrieving historic versions is time-consuming, it has to be specified by the user when calling function ‘ctrLoadQueryIntoDb()’.

For CTGOV, a user has to specify the parameter ‘ctgov2history’ to be either a number (which loads this number of historic versions, at equal time intervals from the first to the current version), a string such as ‘n:m’ (which loads the nth to the mth version) or TRUE (which loads all available versions).

For example, changes in the targeted sample size are an example of the research questions that can be addressed with historic versions:

For CTIS, a user has to set parameter ‘ctishistory’ to TRUE and specify ‘querytoupdate’, which moves a current CTIS record in the database into an array of historic versions in the record, before updating the record from CTIS. Thus, historical versions depend on when a user updates a previous CTIS query; for example, changes in recruitment numbers across current and historic versions can be analysed as follows:

3.8. Linking trial and product data

The research activities in clinical trials can lead to data that show the quality, safety, and efficacy of medicines. Research questions about market and patient access to medicines include the progress from trials to product authorisation. They are examples of questions that require two or more data sources for analysis. Here, data on new molecular entities that are authorised medicinal products are retrieved from openFDA 26 :

Data on trials are retrieved for the new molecular entities one by one, storing the respective name in the user annotation of the retrieved trials:

Therapeutic-confirmatory trials, often referred to as phase 3 trials, are typically needed for regulatory review of applications for marketing authorisation, and here they are merged with product data for a visualisation of trial completion (‘C’) and product authorisation (‘A’, Figure 4):

Figure 4 Completion of phase 3 trials and authorisation of medicines.

3.9. Exploring documents of trials

Study documents offer a rich source of information that is made readily accessible by ctrdata for exploration. Loading documents is activated in function ‘ctrLoadQueryIntoDb()’ by specifying the name of a directory in parameter ‘documents.path’. As a first step for this example, the parameter ‘documents.regexp’ is set to NULL, which causes the function to create empty placeholder files for every document that could be loaded. The names of the files are analysed to obtain an overview of types of available documents.

In a second step, parameter ‘documents.regexp’ can be set to a regular expression that causes downloading the files conforming to the expression.

A broad set tools are available for text analysis of document corpora. A recently published tool chain for a retrieval-augmented generation (RAG) workflow in R is ‘ragnar’.Reference Kalinowski and Falbel 27 This is used in the following example to search for pharmacokinetics in result documents.

3.10. Safety data analysis across registers

Use case 10 is provided in the Supplementary Material to this article.

4. Related tools

There is a small number of tools that are more or less related to the objectives and functionality of package ctrdata.

Tools implemented in R include package ‘rclinicaltrials’.Reference Sachs 28 It only supports CTGOV and has functions to download and transform trial data into R objects; it is not available on the Comprehensive R Archive Network (CRAN) and latest commits were pushed to its public repository in 2017. The recent package ‘clintrialx’ only supports CTGOV (or its derivative AACT).Reference Chakraborty 29 Package ‘cthist’ focuses on historical data from CTGOV.Reference Carlisle 30

Other tools include the ‘clinicaltrials-act-trackeR’ which is implemented in Python and uses CTGOV for analysing reporting compliance. 31 For a comprehensive metadata repository, 32 tools for downloading and storing data from trial registers CTGOV and EUCTR are implemented in C#. 33

While package ctrdata covers functionality of the above-mentioned tools, it supports additional and more diverse use cases, as exemplified in this article. Only ctrdata works with four registers, enables to use all registers in the same workflow, and maximally uses all the registers’ public data.

A curated set of R packages relevant for clinical trials, including ctrdata, is in the ‘CRAN Task View: Clinical Trial Design, Monitoring, and Analysis’.Reference Zhang, Zhang and Zhang 34

5. Limitations and mitigations

Limitations concern the implementation of package ctrdata, the functionality of ctrdata, and the use of registers for research questions.

5.1. Implementation

So far, package ctrdata is based on a single developer, and this situation could impact the quality of its implementation and coding, which in turn may impact code comprehensibility and opportunities for involving other developers. This article contributes to the visibility of ctrdata and to attracting contributors. In addition, the following steps mitigate this potential limitation.

For readability and maintainability, code in package ctrdata follows style conventions, uses standard linting, and is well documented with line and function comments as well as a with a comprehensive website that includes vignettes and examples. Over the years, improvement exercises for already functioning code were repeatedly undertaken, including substantial re-implementation, refactoring, or factoring out, and this improved code quality, limited dependencies, stabilised performance, simplified functions, and helped adding registers or adapting to their changes.

Unit and other tests are written at the time that code is written or issues are fixed, and now more than 630 tests cover more than 94% of the code base. A continuous integration pipeline automates testing on several operating systems and with different database backends.

Since 2015, code was regularly committed and pushed to its public repository. Users from across the world contributed with around 50 issues so far, which are visible in the public repository and were resolved typically in hours or days. Since 2016, package ctrdata is made available in the CRAN, which requires stringent checks.

5.2. Functionality

Limitations could be seen in the functionality of package ctrdata.

For example, ctrdata does not map or translate related data elements from registers to a single common data model for data storage. Another limitation could be seen in the choice of trial registers that are currently supported by ctrdata. The reasons for these two choices that could be perceived as limitations are discussed in Section 1.

Furthermore, even though no formal approach was used for managing business requirements, the current functionality of ctrdata was informed by a broad variety of questions that users sought to answer with trial register data and by wishes for specific functionalities, such as downloading trial-related documents or providing a correspondence matrix of trial identifiers.

5.3. Use for research questions

For research and scientific questions, package ctrdata may be limited if not all relevant registers can be used with ctrdata, or when content of trial registers is incorrect or incomplete. The latter situation can arise from different statutory obligations on trial sponsors. Also, information made publicly available may be ambiguous or not detailed enough for the questions at hand. Early phase trials or trials that were rejected by oversight authorities or ethics committees may have only scarce details or may not be publicly visible at all. Some statutory requirements and information content details are in Table 1.

A challenge with using ctrdata for research questions is that currently interesting topics are often not represented as structured data in trial register information. For example, at this time, platform trials or integrated research platformsReference König, Spiertz and Millar 35 cannot be directly identified in searches or characterised in publicly available data from trial registers. Some registers provide examples for registering trials with such and other less common designs. 36 Package ctrdata offers a set of trial concepts (Table 3) to mitigate the potential limitations of incomplete controlled vocabularies and of evolving research concepts.

For using CTGOV for research, ten common problems were described.Reference Tse, Fain and Zarin 10 Conceptually, many of the issues concern any trial register. Package ctrdata can help addressing or mitigating many issues, as exemplified in Table 4.

Table 4 Selected issues when using trial registers for research, and how package ctrdata can help

6. Discussion

This article presents package ctrdata as a new and unique tool that facilitates analysing protocols and results of clinical trials, in detail and for multiple trials from multiple registers at the same time, in an efficient and reproducible approach. The variety of use cases presented above underlines that ctrdata is a simple yet powerful tool. This article also invites collaboration to improve the tool, for which ctrdata is offered in a public repository.

The objective of this package is to support the scientific use of the vast data on clinical trials in public registers, with a view to better leverage existing, accelerate, and improve future research. The tool thus will be of interest for patients and their organisations, clinicians, clinical researchers, pharmaceutical companies, policymakers, health outcome researchers, and medicine regulators. Use cases that cut across interested parties could be envisioned as follows.

Dashboards can be built rapidly on top of package ctrdata using ready-made frameworks, in analogy to those provided during the COVID-19 pandemic for an integrated presentation of global data from different domains (e.g. epidemiology, clinical trials, and molecular biology). Users can adjust and interpret tabulations, aggregations, and visualisations of trial data, including regional availability for potential participation and results of interventions to inform patient-health professional discussions.

Notification systems can employ package ctrdata to query for changes of completed, ongoing, or newly started trials in different registers and link with other health data systems. Users could set themselves up to be automatically notified when news concern medical conditions, types of medicines, trials in their area or otherwise of interest, or when news concern the authorisation and availability of medicines.

Building on ctrdata, patients, researchers, and regulators can better use trial information when writing patient guides, clinical guidelines or assessing a medicine dossier, reviewing methodological details to build experience, and identifying knowledge gaps for which new trials are needed. Scientific works will benefit from the high level of transparency and completeness combined with minimised biases (e.g. availability and spin) where stringent regulatory requirements for trials are applicable. Repeat analyses with reproducible methods can track the performance and change of the clinical research landscape, which will better inform initiatives, policies, and funding.

With respect to technical aspects of ctrdata, the R environment was chosen for reasons including its availability across operating system platforms, its robust infrastructure with package management and quality assurance, and its increasing base of users. To expand the options for exploring global clinical research with package ctrdata, additional registers are considered, but some interesting registers have policies or technical indications against programmatic access and thus will not be included.

For using register data to answer research questions, register curators have published recommendations on how to avoid common problems with trial data. Package ctrdata helps avoid some of these problems, in particular by fully documenting a reproducible search, download, selection, and analysis of trials, which should permit the validation of derived conclusions.

There are interesting challenges for analysing the clinical trial landscape using information from public registers. For example, it remains difficult to track complex clinical trials 37 ; here, an approximating concept in package ctrdata may help develop an international approach. It is also sought to address difficulties in identifying and analysing specific trial features such as related to the estimands framework, adaptations of trials during their conduct, and features such as decentralised elements, because these could be impactful for accelerating and improving future trials. Furthermore, data linkage will become a common use case, using trial register information with other public or private data sources, such as exposed by the openFDA APIs 26 as shown above, publication databases, the EU Open Data Portal 38 or health outcome, and real-world data bases. 39 In particular, use cases could link trial register data with regulatory dossiers such as the EU Clinical data publication 40 or Health Canada’s Public Release of Clinical Information 41 and, perhaps most importantly, with regulatory reviews of the design, conduct, and results of trials in medicine dossiers as documented in assessment reports. 42

For the ongoing development of package ctrdata, consideration is given to such challenges as well as to EU initiatives such as for clinical trials 43 and for collaboration on regulatory science questions.Reference Barbier, Moscariello, Leufkens, Herold and Pasmooij 44 Questions concerning features of clinical trials are also among the recently updated regulatory science research needs. 45

In summary, trial registers accumulate data rapidly and continuously: the more the data are explored and used widely, the more they will become valuable information to accelerate clinical research and to improve health care. Package ctrdata enables to make great strides towards these goals.

Acknowledgements

I thank Franz König (University of Vienna, Austria) and Peter Arlett (European Medicines Agency) for encouraging me to prepare this article, data providers and register curators for their dedication to maintain data quality and transparency, and creators and maintainers of R and its rich set of packages.

Author contributions

R.H.: Conceptualisation, Data curation, Formal analysis, Methodology, Project administration, Resources, Software, Writing—original draft, Writing—review and editing.

Competing interest statement

The author of the article and creator of package ctrdata is a full-time employee of a decentralised agency of the European Union and declares the absence of competing interests.

Data availability statement

The software presented in this paper is available as open source with a permissive licence (MIT) at https://cran.r-project.org/package=ctrdata. This page includes links to the comprehensive documentation website of the package at https://rfhb.github.io/ctrdata/ and to its public repository at https://github.com/rfhb/ctrdata/, which can be used to ask questions, flag issues, and suggest improvements. The programming script for the use cases (Section 3) is provided as Supplementary Material to this publication.

Funding statement

No funding was requested or obtained for package ctrdata or any work described in this article.

Disclaimer

The views expressed in this article are the personal views of the author and may not be understood or quoted as being made on behalf of or reflecting the position of the European Medicines Agency or one of its committees or working parties.

The package ctrdata presented in this article resulted exclusively from an outside activity of the author authorised by the European Medicines Agency and may not be understood as a service or product of the European Medicines Agency.

Disclosure of use of AI tools

No AI tools were used in any part of the writing process of this article or in the programming of package ctrdata, and no AI tools are required for using ctrdata. However, two of the use cases explain how ctrdata could be used with AI tools that users may have and wish to use in their research.

Supplementary material

To view supplementary material for this article, please visit http://doi.org/10.1017/rsm.2025.10061.

Footnotes

This article was awarded Open Materials badge for transparent practices. See the Data availability statement for details.

References

Getz, KA, Stergiopoulos, S, Short, M, et al. The impact of protocol amendments on clinical trial performance and cost. Ther Innov Regul Sci. 2016;50(4):436441. https://doi.org/10.1177/2168479016632271.CrossRefGoogle ScholarPubMed
Egger, GF, Herold, R, Rodriguez, A, Manent, N, Sweeney, F, Saint Raymond, A. European Union clinical trials register: on the way to more transparency of clinical trial data. Expert Rev Clin Pharmacol. 2013;6(5):457459. https://doi.org/10.1586/17512433.2013.827404.CrossRefGoogle ScholarPubMed
Parsons, S, Starling, B, Mullan-Jensen, C, et al. What the public knows and wants to know about medicines research and development: a survey of the general public in six European countries. BMJ Open. 2015;5(4):e006420. https://doi.org/10.1136/bmjopen-2014-006420.CrossRefGoogle ScholarPubMed
Califf, RM, Cutler, TL, Marston, HD, Meeker-O’Connell, A. The importance of ClinicalTrials.Gov in informing trial design, conduct, and results. J Clin Transl Sci. 2025;9(1):e42. https://doi.org/10.1017/cts.2025.9.Google ScholarPubMed
Alqaidoom, Z, Nguyen, P, Awadh, M, Page, MJ. Impact of searching clinical trials registers in systematic reviews of pharmaceutical and non-pharmaceutical interventions: reanalysis of meta-analyses. Res Synth Methods. 2023;14(1):5267. https://doi.org/10.1002/jrsm.1583.CrossRefGoogle ScholarPubMed
Herold, R. ctrdata: retrieve and analyze clinical trials data from public registers. Published August 26, 2025. Accessed August 1, 2025. https://CRAN.R-project.org/package=ctrdata.Google Scholar
European Medicines Agency. Juvenile Animal Studies (JAS) and impact on anti-cancer medicine development and use in children. Published 2017. Accessed April 2, 2018. https://www.ema.europa.eu/en/juvenile-animal-studies-jas-impact-anti-cancer-medicine-development-use-children-scientific-guideline.Google Scholar
Lasch, F, Psarelli, EE, Herold, R, et al. The impact of COVID-19 on the initiation of clinical trials in Europe and the United States. Clin Pharmacol Ther. 2022;111(5):10931102. https://doi.org/10.1002/cpt.2534.CrossRefGoogle ScholarPubMed
Tse, T, Fain, KM, Zarin, DA. How to avoid common problems when using ClinicalTrials.Gov in research: 10 issues to consider. BMJ. 2018;361:k1452. https://doi.org/10.1136/bmj.k1452.Google ScholarPubMed
International Committee of Medical Journal Editors. ICMJE: uniform requirements for manuscripts submitted to biomedical journals. Published April 2010. Accessed October 28, 2011. http://www.icmje.org/urm_main.html.Google Scholar
ClinicalTrials.gov. Data element-to-API field crosswalks. 2019. Accessed April 18, 2021. https://clinicaltrials.gov/api/gui/ref/crosswalks.Google Scholar
Clinical Data Interchange Standards Consortium. Clinical Trial Registry (CTR)-XML. 2016. Accessed July 3, 2016. https://www.cdisc.org/standards/data-exchange/ctr-xml.Google Scholar
ClinicalTrials.gov. PRS user’s guide. Published June 14, 2024. Accessed March 29, 2025. https://clinicaltrials.gov/submit-studies/prs-help/user-guide#section9.Google Scholar
EudraCT Secure Results Documentation Page. Published November 14, 2023. Accessed March 29, 2025. https://eudract.ema.europa.eu/result.html.Google Scholar
European Medicines Agency. CTIS—Clinical trials in the European Union. January 31, 2022. Accessed March 29, 2025. https://euclinicaltrials.eu/.Google Scholar
Ecma International. Standard ECMA-404— the JSON data interchange syntax. Published December 2017. Accessed March 4, 2019. https://www.ecma-international.org/publications/standards/Ecma-404.htm.Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing; 2020. https://www.R-project.org/.Google Scholar
Herold, R, Chamberlain, S, FitzJohn, R, Ooms, J. nodbi: “NoSQL” database connector. Published June 26, 2025. Accessed August 1, 2025. https://CRAN.R-project.org/package=nodbi.Google Scholar
MongoDB, Inc. Aggregation pipeline. 2021. Accessed April 9, 2025. https://www.mongodb.com/docs/manual/core/aggregation-pipeline/.Google Scholar
Herold, R. ctrdata: script to automatically copy user’s query from web browser. Published April 24, 2023. Accessed April 24, 2025. https://rfhb.github.io/ctrdata/#id_2-script-to-automatically-copy-users-query-from-web-browser.Google Scholar
International Council for Harmonisation. General principles for planning and design of multi-regional clinical trials E17. 2017. Accessed April 13, 2025. https://database.ich.org/sites/default/files/E17EWG_Step4_2017_1116.pdf.Google Scholar
Pétavy, F, Guizzaro, L, dos, RIA, Teerenstra, S, Roes, KCB. Beyond “intent-to-treat” and “per protocol”: improving assessment of treatment effects in clinical trials through the specification of an estimand. Br J Clin Pharmacol. 2020;86(7):12351239. https://doi.org/10.1111/bcp.14195.CrossRefGoogle ScholarPubMed
Folgori, L, Bielicki, J, Ruiz, B, et al. Harmonisation in study design and outcomes in paediatric antibiotic clinical trials: a systematic review. Lancet Infect Dis. 2016;16(9):e178e189. https://doi.org/10.1016/S1473-3099(16)00069-4.CrossRefGoogle ScholarPubMed
FDA. openFDA—API—reference. Published June 19, 2016. Accessed June 22, 2016. https://open.fda.gov/api/reference/.Google Scholar
Kalinowski, T, Falbel, D. Ragnar: retrieval-augmented generation (RAG) workflows. 2025. https://ragnar.tidyverse.org/.10.32614/CRAN.package.ragnarCrossRefGoogle Scholar
Sachs, M. sachsmc/rclinicaltrials. Published November 28, 2020. Accessed April 11, 2021. https://github.com/sachsmc/rclinicaltrials.Google Scholar
Chakraborty, I. clintrialx: connect and work with clinical trials data sources. Published March 11, 2025. Accessed April 5, 2025. https://cran.r-project.org/web/packages/clintrialx/index.html.10.32614/CRAN.package.clintrialxCrossRefGoogle Scholar
Carlisle, BG. cthist: Clinical Trial Registry history. Published July 17, 2024. Accessed April 5, 2025. https://cran.r-project.org/web/packages/cthist/index.html.Google Scholar
EBM DataLab. ebmdatalab/clinicaltrials-act-tracker. Published December 10, 2020. Accessed April 11, 2021. https://github.com/ebmdatalab/clinicaltrials-act-tracker.Google Scholar
Clinical Research Metadata Repository. Accessed April 5, 2025. https://crmdr.ecrin.org/.Google Scholar
ecrin-github/DataHarvester. Published October 12, 2021. Accessed April 5, 2025. https://github.com/ecrin-github/DataHarvester.Google Scholar
Zhang, E, Zhang, WG, Zhang, RG. CRAN task view: clinical trial design, monitoring, and analysis. Published May 20, 2023. Accessed March 29, 2025. https://CRAN.R-project.org/view=ClinicalTrials.Google Scholar
König, F, Spiertz, C, Millar, D, et al. Current state-of-the-art and gaps in platform trials: 10 things you should know, insights from EU-PEARL. eClinicalMedicine. 2023;67:102384. https://doi.org/10.1016/j.eclinm.2023.102384.Google Scholar
ClinicalTrials.gov. Support and training materials. Published 2025. Accessed April 5, 2025. https://clinicaltrials.gov/submit-studies/prs-help/support-training-materials#example-studies.Google Scholar
EC / EMA / HMA. Complex clinical trials – questions and answers. 2022. Accessed June 12, 2024. https://health.ec.europa.eu/system/files/2022-06/medicinal_qa_complex_clinical-trials_en.pdf.Google Scholar
The Official Portal for European Data. Accessed April 5, 2025. https://data.europa.eu/en.Google Scholar
HMA-EMA. Catalogues of Real-World Data Sources and Studies. 2024. Accessed January 6, 2025. https://catalogues.ema.europa.eu/.Google Scholar
Clinical Data Publication. Accessed April 5, 2025. https://clinicaldata.ema.europa.eu/web/cdp/background.Google Scholar
Health Canada. Clinical information on drugs and health products. Published March 12, 2019. Accessed April 5, 2025. https://clinical-information.canada.ca/search/ci-rc.Google Scholar
European Medicines Agency. European public assessment reports: background and context. Published September 12, 2012. Accessed April 5, 2025. https://www.ema.europa.eu/en/medicines/what-we-publish-medicines-when/european-public-assessment-reports-background-context.Google Scholar
Accelerating clinical trials in the EU. Accessed April 5, 2025. https://accelerating-clinical-trials.europa.eu/index_en.Google Scholar
Barbier, L, Moscariello, P, Leufkens, HG, Herold, R, Pasmooij, AMG. A new European platform for advancing regulatory science research. Nat Rev Drug Discov. 2025;10. https://doi.org/10.1038/d41573-025-00024-y.Google Scholar
European Medicines Agency. Regulatory science research needs 2025 update. 2025. Accessed July 31, 2025. https://www.ema.europa.eu/en/documents/other/regulatory-science-research-needs-2025-update_en.pdf.Google Scholar
Figure 0

Figure 1 Overview on using R package ctrdata. The arrow means ‘makes use of’. The user can execute functions in package ctrdata that query clinical trial registers (CTRs) and load data of trials of interest. Such functions include ‘ctrGenerateQueries()’, ‘ctrLoadQueryIntoDb()’, and ‘ctrFindActiveSubstanceSynonyms()’. These functions make use of the application programming interfaces (APIs) and web interfaces of four CTRs. The user can inspect trials with ctrdata function ‘ctrShowOneTrial()’ and select fields with ‘dbFindFields()’. Package ctrdata uses package nodbi for storing clinical trial data in SQLite, DuckDB, MongoDB, or PostgreSQL. Data of interest are generated using ctrdata functions ‘dbGetFieldsIntoDf()’ (which extracts data from the database, combines data from different registers into a data set, and calculates concepts across trials), ‘dfTrials2Long()’, and ‘dfName2Value()’ (which reshape a data set and select nested fields based on field identifiers).

Figure 1

Table 1 Overview of registers supported by ctrdata (numbers rounded to three significant digits; XML, extensible markup language; JSON, JavaScript Object Notation; ‘other types of clinical studies’ refers to studies of medical devices, behavioural and other health interventions, observational, non-interventional, and other studies)

Figure 2

Table 2 Main functions in ctrdata in order of a potential workflow

Figure 3

Table 3 Overview on clinical trial concept functions implemented in ctrdata

Figure 4

Figure 2 Boxplot of sample size by phase of trial.

Figure 5

Figure 3 Cumulative density of reported p-values (dotted line, p = 0.05).

Figure 6

Figure 4 Completion of phase 3 trials and authorisation of medicines.

Figure 7

Table 4 Selected issues when using trial registers for research, and how package ctrdata can help

Supplementary material: File

Herold supplementary material

Herold supplementary material
Download Herold supplementary material(File)
File 33.8 KB