Hostname: page-component-68c7f8b79f-j7jzg Total loading time: 0 Render date: 2026-01-16T03:59:10.135Z Has data issue: false hasContentIssue false

A Practical Approach to Minimizing Risk from Multiplicity in Statistical Reporting

Published online by Cambridge University Press:  08 January 2026

Jeffrey Michael Franc*
Affiliation:
Associate Professor, Department of Emergency Medicine, University of Alberta Visiting Professor in Disaster Medicine, Università del Piemonte Orientale Adjunct Faculty, Harvard/BIDMC Disaster Medicine Fellowship Editor-in-Chief, Prehospital and Disaster Medicine
*
Correspondence: Jeffrey Michael Franc Department of Emergency Medicine 736c University Terrace 8203-112 Street NW Edmonton, AB, Canada, T6G 2T4 E-mail: jeffrey.franc@ualberta.ca
Rights & Permissions [Opens in a new window]

Abstract

Unfortunately, P value multiplicity continues to be a pervasive threat to statistical validity in medical research. Performing many hypothesis tests, and treating them each as if they were a single hypothesis, leads to a dramatic increase in the risk of false research claims. This editorial describes a simple method for authors to avoid P value multiplicity while improving clarity of the findings for the reader.

Information

Type
Editorial
Copyright
© The Author(s), 2026. Published by Cambridge University Press on behalf of World Association for Disaster and Emergency Medicine

Introduction

The issue of P value multiplicity is a subtle, yet significant, challenge frequently encountered in medical research. Understanding P values, and how to properly address multiple hypothesis tests, can be a source of confusion - even for some statisticians.

Authors often feel pressured to provide significant P values, fearing that without having a “positive study,” there is little chance of publication.

Furthermore, statistical software has made creating P values simple. Once the data are loaded into the software, it is easy to devise and implement a near limitless array of statistical tests. Authors are often tempted to keep hunting until they find a significant P value and then report this as the primary outcome of the study.

Unfortunately, such use of P values is both inappropriate and misleading.

What is a P Value Anyways?

The P value is the probability, calculated assuming that the null hypothesis is true, of obtaining a statistic value at least as contradictory to the null hypothesis as a value that actually resulted. The smaller the P value, the more contradictory are the data to the null hypothesis. Reference Devore1

For example, assume that the null hypothesis that a deck of cards is a standard 52 card deck. If one draws four cards at random, according to the hypergeometric probability theorem, there is a 495/270,725, or approximately 0.001829% chance of getting four face cards by chance. If we did so on our first try, it might very well lead us to reject the null hypothesis and conclude that the deck is not fair.

However, we note that the chance is not zero. And we can surmise that if we were to keep trying repeatedly, even if the deck is fair, eventually we would get four face cards and falsely reject the null hypothesis. Falsely rejecting the null hypothesis when in fact the null hypothesis is true is called a Type I error.

Building a Better Dart

Imagine you’ve invented a brand-new kind of dart, and you’re convinced it’s far more accurate than any dart you’ve used before. Since you know from your experience that your usual accuracy of hitting a bullseye is about 10%, you set up the following null and alternative hypotheses:

  • H0: P = 0.1; where P (the probability of getting a bullseye) is 10%;

  • HA: P > 0.1.

To test your hypothesis, you devise a simple statistical test. You throw the new dart at the board five times. If it lands right in the center more than twice, you declare that it’s likely more accurate than the usual darts you’ve used. In fact, according to the binomial theorem, there is only about an eight percent chance of scoring two or more bullseyes in five throws if the accuracy is truly 10%. Thus, the probability of committing a Type I error, and falsely rejecting the null hypothesis, is approximately eight percent.

Now, let’s consider a different scenario. Instead of just one dart, you test 25 different darts, each with its own unique color. You throw each dart five times to see which color dart hits two or more bullseyes in the five trials. Eventually, one of them is bound to succeed just by chance. You pick that lucky dart—the one that performed best out of the 25—and you announce, “This color of dart is the most accurate of them all.” Of course, you’ve now multiplied your chances of finding at least one dart that passes your test purely by coincidence. In fact, even if none of the darts are better than your usual darts, there is an 88% chance (again by the binomial theorem) that at least one of the 25 darts will get two bullseyes in five throws. Rephrased, the probability of committing a Type I error and falsely rejecting the null hypothesis is approximately 88%.

This analogy underscores the problem of P value multiplicity. When multiple statistical tests are conducted, the probability of obtaining at least one false positive result increases. Making many hypothesis tests, then only reporting about those that are positive, is commonly called P value hacking. Unfortunately, P-hacking remains wide-spread. In a 2012 survey of over 2,000 US psychologists, 50% admitted to selectively reporting only studies that had produced significant results. Reference Nuzzo2 Additionally, 58% acknowledged looking at preliminary results before deciding whether to continue data collection, 43% reported discarding data after seeing its effect on the P value, and 35% admitted to similar questionable research practices. Reference Nuzzo2 Along with other statistical violations, P value hacking is worsening the crisis of reproducibility in science. Reference Ioannidis3

Multiplicity can sneak into a study several ways. For instance, some studies have many outcomes, and test multiple P values, and then report only the significant ones as important findings of the study. Sometimes researchers publish all P values; sometimes the reader is not even aware of how many hypothesis tests have been done. Sometimes multiple tests are performed in the same study: for example, testing a placebo against four different doses of a drug multiplies the probability of committing a Type I error by four. Another common area where multiplicity becomes problematic is in large regression equations, where authors test many potentially important factors with multiple P values without accounting for this multiplicity.

To maintain high-quality research, it is essential for researchers to account for and control the number of statistical comparisons made.

Dealing with Multiplicity

How can we address this issue? One method involves statistical adjustments, such as the Bonferroni correction, where the significance level is lowered to account for multiple tests. The Bonferroni method is a straightforward approach in which the significance level (alpha) is divided by the number of statistical tests, adjusting the threshold for each individual P value to provide strong control of Type I errors while remaining simple and easy to interpret.

Many alternative methods exist for adjusting for multiple hypothesis testing, including Tukey’s method, Scheffé’s method, Fisher’s Least Significant Difference (LSD) method, and the Hsu model. However, these approaches are often more complex to apply correctly and may be even more challenging for readers to interpret accurately. Although we will not delve deeply into these methods here, readers may refer to standard texts, such as Hsu’s work on multiplicity, for a thorough discussion of these statistical approaches. Reference Hsu4

Consulting a statistician early in the study design process is always an excellent choice.

A Simple Approach to Combating Multiplicity

A simpler, and often more effective, way to handle multiplicity is to structure the paper around one single primary hypothesis and reserve formal statistical testing (P values) for that main objective. Secondary and exploratory outcomes are presented as unadjusted confidence intervals without P values. These outcomes are treated as hypothesis generating, not hypothesis confirming.

This approach is especially useful in smaller studies such as those usually performed in Prehospital and Disaster Medicine.

The steps are as follows:

  1. 1. Declare a Single Primary Objective: In the Introduction, clearly state the single primary objective and the primary hypothesis that will be formally tested.

  2. 2. Describe the Primary Outcome: In the Outcome Measures subsection of the Methods, clearly describe the primary outcome.

  3. 3. Describe the Secondary Outcome: Still in the Outcome Measures subsection, list any pre-specified secondary outcomes.

  4. 4. Specify the Single Test Statistic: In the Data Analysis subsection of the Methods, detail the exact statistical test used to evaluate the primary outcome and the level of significance chosen for the study. Even if the primary outcome involves comparisons across multiple tests or includes multiple subgroups, researchers should still aim to use a single, omnibus statistical test to assess overall significance. Consulting a statistician is strongly recommended to researchers who are having difficulty finding a single best test.

  5. 5. Outline Secondary and Exploratory Analysis: Still in the Data Analysis subsection, list any test statistics for secondary or exploratory outcomes. Add the text “The widths of the 95% confidence intervals have not been adjusted for multiplicity, and the intervals should not be used in place of hypothesis testing.”

  6. 6. Report Primary Results Clearly: In the Results section, clearly present the results of the primary hypothesis test, usually in its own paragraph or under the subsection Primary Outcome. Include the P value and a 95% confidence interval.

  7. 7. Differentiate Secondary and Exploratory Results: Also in the Results, label secondary and exploratory outcomes clearly with a subtitle or separate paragraph. Include the unadjusted 95% confidence intervals, but do not include any P values.

  8. 8. Add a Table Footnote: In any tables of secondary or exploratory outcomes, add the footnote “The widths of 95% confidence intervals have not been adjusted for multiplicity, and they should not be used in place of hypothesis testing.”

  9. 9. Address the Primary Outcome: In the Discussion section, usually in the first paragraph, address the primary outcome. Ideally, it is best to explicitly state if the primary hypothesis is rejected. Include interpretation of the confidence interval - this is particularly important if the P value is not significant.

  10. 10. Discuss with Caution: When presenting secondary or exploratory outcomes in the Discussion, avoid language that implies definitive conclusions. Instead, these findings should be framed as preliminary and intended primarily for hypothesis generation, highlighting the need for future studies to formally test these observations.

By following these structured approaches, researchers can minimize the risks associated with P value multiplicity and produce more robust and reliable scientific conclusions.

An Example of Best Practice in P Value Interpretation

An excellent example of careful and transparent use of P values is found in the randomized trial by Garaghty, et al published in the New England Journal of Medicine, entitled “Video versus Direct Laryngoscopy for Urgent Intubation of Newborn Infants.” Reference Geraghty, Dunne and Ní Chathasaigh5

In the final sentence of the Introduction, the authors clearly state the primary objective: to determine whether the use of a video laryngoscope improves first-attempt success rates during urgent oral endotracheal intubation in neonatal units. The methods section precisely defines the primary outcome as successful intubation on the first attempt, confirmed by exhaled carbon dioxide detection. The statistical analysis section states that this outcome was compared using a chi-squared test, with a two-sided P value of <.05 considered statistically significant.

Importantly, the authors also disclose that categorical and continuous safety and secondary outcomes were summarized as percentages and medians, respectively, and that the 95% confidence intervals were not adjusted for multiplicity—explicitly cautioning that these intervals should not be interpreted as formal hypothesis tests. In the Results, the primary outcome is clearly labeled and includes both the P value and the corresponding 95% confidence interval. Secondary outcomes are transparently separated by a subtitle, avoiding any confusion. Finally, tables throughout the manuscript include a footnote indicating that subgroup analyses were post hoc and that the widths of confidence intervals were unadjusted for multiplicity and should not substitute for formal hypothesis testing. In total, the paper presents only one P value.

Of note, the authors also do an excellent job in presenting the differences between the two randomized groups—typically displayed in Table 1 of any randomized trial—without using inferential statistical tests. This aligns with well-established best practices, which discourage hypothesis testing for baseline characteristics in randomized studies. Reference Hayes-Larson, Kezios, Mooney and Lovasi6 Instead, descriptive statistics alone are used to highlight any imbalances.

Moving Forward

At Prehospital and Disaster Medicine, our primary goal is to publish innovative, high-impact, evidence-based research. Importantly, this goal does not mandate that all published papers have a significant P value. Rather, editors at PDM are far more likely to be impressed by a solid study design and controlled use of hypothesis testing rather than by a very small P value. The journal frequently publishes studies with non-significant P values provided that the overall study is of high quality.

Unfortunately, some authors submit an innovative and potentially high-impact manuscript containing many P values without consideration of multiplicity. If the editors feel that the study warrants publication, the PDM team will work directly with these authors, using a framework such as the 10-step process above, to revise these papers to meet PDM quality standards.

Author Contributions

JMF is responsible for all content.

Use of AI Technology: ChatGPT Version 5 (OpenAI; San Francisco, California USA) was used for assistance with writing; however, the author remains responsible for all content.

Conflicts of interest

JMF is the CEO and founder of STAT59 and the Editor-in-Chief of Prehospital and Disaster Medicine.

References

Devore, JL. Probability and Statistics for Engineering and the Sciences. Seventh ed. Boston, Massachusetts USA: Thomson/Brooks/Cole, Cengage Learning; 2008.Google Scholar
Nuzzo, R. How scientists fool themselves – and how they can stop. Nature. 2015;526(7572):182185.10.1038/526182aCrossRefGoogle Scholar
Ioannidis, JPA. Why most published research findings are false. PLoS Med. 2005;2(18):06960701.10.1371/journal.pmed.0020124CrossRefGoogle ScholarPubMed
Hsu, JC. Multiple Comparisons: Theory and Methods. Abingdon, United Kingdom: Taylor & Francis; 1996.Google Scholar
Geraghty, LE, Dunne, EA, Ní Chathasaigh, CM, et al. Video versus direct laryngoscopy for urgent intubation of newborn infants. N Engl J Med. 2024;390(20):18851894.10.1056/NEJMoa2402785CrossRefGoogle ScholarPubMed
Hayes-Larson, E, Kezios, KL, Mooney, SJ, Lovasi, G. Who is in this study, anyway? Guidelines for a useful Table 1. J Clin Epidemiol. 2019;114:125132.10.1016/j.jclinepi.2019.06.011CrossRefGoogle ScholarPubMed