Introduction
The Institute of Medicine (IOM) framework classifies mental health interventions into different categories, including promotion, prevention and treatment (Institute of Medicine, 2009). Numerous randomized controlled trials (RCTs) and systematic reviews have assessed the efficacy and effectiveness of promotion, prevention and treatment interventions for mental health across diverse populations (Barbui et al., Reference Barbui, Purgato, Abdulmalik, Acarturk, Eaton, Gastaldon, Gureje, Hanlon, Jordans, Lund, Nose, Ostuzzi, Papola, Tedeschi, Tol, Turrini, Patel and Thornicroft2020; Hoare et al., Reference Hoare, Collins, Marx, Callaly, Moxham-Smith, Cuijpers, Holte, Nierenberg, Reavley, Christensen, Reynolds, Carvalho, Jacka and Berk2021; Papola et al., Reference Papola, Purgato, Gastaldon, Bovo, van Ommeren, Barbui and Tol2020; Purgato et al., Reference Purgato, Albanese, Papola, Prina, Tedeschi, Gross, Sijbrandij, Acarturk, Annoni, Silva, Jordans, Lund, Tol, Cuijpers and Barbui2023; van Zoonen et al., Reference van Zoonen, Buntrock, Ebert, Smit, Reynolds, Beekman and Cuijpers2014).
Promotion focuses on empowering individuals to enhance their well-being and resilience, and can be a distinct strategy or part of prevention and/or treatment efforts (Eaton, Reference Eaton2019). Prevention aims to avert, reduce or delay the onset of mental disorders through universal, selective or indicated approaches. Prevention may involve early detection and diagnosis, targeting risk and protective factors and reducing the impact of disease on functionality and quality of life (e.g., relapse prevention) (Acarturk et al., Reference Acarturk, Uygun, Ilkkursun, Carswell, Tedeschi, Batu, Eskici, Kurt, Anttila, Au, Baumgartner, Churchill, Cuijpers, Becker, Koesters, Lantta, Nose, Ostuzzi, Popa, Purgato, Sijbrandij, Turrini, Valimaki, Walker, Wancata, Zanini, White, van Ommeren and Barbui2022; Augustinavicius et al., Reference Augustinavicius, Purgato, Tedeschi, Musci, Leku, Carswell, Lakin, van Ommeren, Cuijpers, Sijbrandij, Karyotaki, Tol and Barbui2023; Buntrock et al., Reference Buntrock, Berking, Smit, Lehr, Nobis, Riper, Cuijpers and Ebert2017; Purgato et al., Reference Purgato, Singh, Acarturk and Cuijpers2021a, Reference Purgato, Carswell, Tedeschi, Acarturk, Anttila, Au, Bajbouj, Baumgartner, Biondi, Churchill, Cuijpers, Koesters, Gastaldon, Ilkkursun, Lantta, Nosè, Ostuzzi, Papola, Popa, Roselli, Sijbrandij, Tarsitani, Turrini, Välimäki, Walker, Wancata, Zanini, White, van Ommeren and Barbui2021b; Riello et al., Reference Riello, Purgato, Bove, Tedeschi, MacTaggart, Barbui and Rusconi2021; Purgato et al., Reference Purgato, Tedeschi, Riello, Zaccoletti, Mediavilla, Ayuso-Mateos, MacTaggart, Barbui and Rusconi2025a, Reference Purgato, Tedeschi, Turrini, Cadorin, Compri, Muriago, Ostuzzi, Pinucci, Prina, Serra, Tarsitani, Witteveen, Roversi, Melchior, McDaid, Park, Petri-Romao, Kalisch, Underhill, Bryant, Mediavilla Torres, Ayuso-Mateos, Felez Nobrega, Haro, Sijbrandij, Nose and Barbui2025b). Treatment targets individuals with existing mental disorders, aiming to alleviate symptoms and improve functioning (Eaton, Reference Eaton2019; Tol et al., Reference Tol, Purgato, Bass, Galappatti and Eaton2015).
However, distinguishing between promotion, prevention and (early) treatment trials remains challenging for at least four reasons (Cuijpers, Reference Cuijpers2022). First, prevention studies require outcome measures that entail a clear ascertainment of the onset of a new disorder in adequately sized samples. However, this is lengthy, resource-intensive and often unfeasible, especially in low-resource settings. Many trials rely on proxy outcomes of incidence, such as symptom worsening, which can obscure the true effectiveness of prevention efforts and conceptually overlap with how treatments are commonly evaluated in RCTs. Second, binary classification of mental health is challenging and may oversimplify its complexity, as mental health exists more on a continuum rather than a strict healthy/ill dichotomy (Papola and Patel, Reference Papola and Patel2025; Patel et al., Reference Patel, Saxena, Lund, Thornicroft, Baingana, Bolton, Chisholm, Collins, Cooper, Eaton, Herrman, Herzallah, Huang, Jordans, Kleinman, Medina-Mora, Morgan, Niaz, Omigbodun, Prince, Rahman, Saraceno, Sarkar, De Silva, Singh, Stein, Sunkel and Unützer2018). The chosen outcome measure may not necessarily be informative about whether the preventive potential of an intervention was investigated in the study or not. Third, studies may simultaneously focus on promotion, prevention and treatment. For several mental health and psychosocial support (MHPSS) interventions tested in RCTs, the boundaries between promotion, prevention and treatment are blurred, as all of them may be represented to different degrees. Moreover, the contents, components, levels, actors and possible mechanisms of action of MHPSS are only seldom explicitly described in many RCTs (Papola et al., Reference Papola, Prina, Ceccarelli, Cadorin, Gastaldon, Ferreira, Tol, van Ommeren, Barbui and Purgato2024; Purgato et al., Reference Purgato, Singh, Acarturk and Cuijpers2021a). This implies issues related to the proverbial black box, poor unpacking of the interventions’ components, and no explicit reference to an a priori theory of change (Miller et al., Reference Miller, Jordans, Tol and Galappatti2021). Intervention manuals are often not reported or lack details on whether the intervention being tested was designed for promotion, prevention or treatment (Cuijpers et al., Reference Cuijpers, Boyce and van Ommeren2024; Watts et al., Reference Watts, van Ommeren and Cuijpers2020). Fourth, having clarity on the aim of an intervention being evaluated is important for future use and implementation. If the distinction between promotion, prevention and treatment is diffuse, it might reflect uncertainty about the ultimate purpose of the intervention, and thereby its future use. Fifth, the reporting of many RCTs may be suboptimal, limiting appraisal of internal and external validity, and, therefore, inferential interpretation of findings. Sampling procedures are often poorly reported, and the inclusion and exclusion criteria of participants are often unclear (Miller et al., Reference Miller, Rasmussen and Jordans2023). This makes it difficult to draw coherent lines between populations at risk of and with mental disorders. For example, the same intervention given to the former population may be conceived as preventive, while for the latter as treatment.
Promotion, prevention and treatment RCTs should be clearly discerned to facilitate a better understanding of research findings in the field of public mental health. Appropriately appraising RCTs not only facilitates better overall evidence synthesis but also serves as a crucial tool for enhancing knowledge translation and policy uptake. A clear distinction can finally help researchers optimize the choice of outcomes, pinpoint research gaps, allocate efforts and resources where needed most, and avoid redundancies.
Against this background, we set out to develop a new measurement tool – the VErona-LUgano Tool (VELUT) – designed to position RCTs of MHPSS interventions along the promotion-to-treatment continuum through critical appraisal. This paper describes the methodological process, statistical analyses and results involved in the development and preliminary validation of this tool. It also provides the tool itself, along with instructions on how to use it in evidence synthesis.
Methods
The protocol for this study was approved by the Ethics Committee of Università della Svizzera Italiana (CE_2024_04 of 25-01-2024), and has been published (Purgato et al., Reference Purgato, Albanese, Papola, Prina, Tedeschi, Gross, Sijbrandij, Acarturk, Annoni, Silva, Jordans, Lund, Tol, Cuijpers and Barbui2023).
The methods, processes and procedures described here are adapted from the practical guide for the development of health measurement scales (David and Norman, Reference David and Norman2015), which we complemented with elements of the Child Health and Nutrition Research Initiative research prioritization method (CHNRI) (Rudan, Reference Rudan2016; Rudan et al., Reference Rudan, Yoshida, Chan, Sridhar, Wazny, Nair, Sheikh, Tomlinson, Lawn, Bhutta, Bahl, Chopra, Campbell, El Arifeen, Black and Cousens2017; Shah et al., Reference Shah, Albanese, Duggan, Rudan, Langa, Carrillo, Chan, Joanette, Prince, Rossor, Saxena, Snyder, Sperling, Varghese, Wang, Wortmann and Dua2016). The VELUT is an outcome-based tool composed of questions with appropriate answer options. The items were devised, drafted, selected and tested in multiple rounds.
The development of VELUT began with a conceptualization step, in which we established a tool development group (TDG) composed of international experts (n = 14) in public mental health, global mental health and complementary disciplines, i.e., statistics and epidemiology. The TDG members are from 10 universities in 9 countries and are all co-authors of the present paper. The two coordinators of the TDG are the first co-authors of this paper (M.P. and E.A.) and worked in close collaboration with the last author, who originally conceived this endeavour (C.B.). Figure 1 presents the flowchart of the VELUT development methodological steps, described in detail below.

Figure 1. Study flowchart.
Devising items
The TDG adopted an iterative process to define the construct(s) targeted for assessment and agreed to use both the Population, Intervention, Comparison and Outcome (PICO) framework and the IOM mental disorders preventive intervention research framework for devising, reviewing and selecting items (Institute of Medicine, 2009; Richardson et al., Reference Richardson, Wilson, Nishikawa and Hayward1995).
The TDG generated an initial pool of items related to, and collectively reflecting, the conceptualized constructs. We used two complementary approaches for item generation. First, we systematically searched for existing measures already developed to position RCTs along the promotion-to-treatment continuum; we found no studies reporting similar tools in any health field (Purgato et al., Reference Purgato, Albanese, Papola, Prina, Tedeschi, Gross, Sijbrandij, Acarturk, Annoni, Silva, Jordans, Lund, Tol, Cuijpers and Barbui2023). Second, the TDG integrated multiple sources to develop the items, including theoretical frameworks, research expertise in the design and conduction of RCTs of MHPSS interventions, and collaboration with practitioners. Qualitative methods were employed with the TDG to elicit additional themes and insights regarding the usability and relevance of our tool. The qualitative methods included focus groups with TDG members, the tool’s intended end users, and three interview rounds focused on each individual item, its intended meaning, and exploration of its relevance and understanding. We also identified and discussed key methodological features of RCTs relevant to assessing the study’s promotion, prevention or treatment nature.
Selection of items
First, TDG members categorized all devised items according to the PICO elements, carefully considering the potential of each item to inform the promotion, prevention or treatment nature of the RCTs domains according to the IOM framework (Institute of Medicine, 2009). Second, the two TDG coordinators (M.P., a clinical psychologist, and E.A., an epidemiologist) applied techniques adapted from the CHNRI research prioritization methodology (Rudan, Reference Rudan2016; Rudan et al., Reference Rudan, Yoshida, Chan, Sridhar, Wazny, Nair, Sheikh, Tomlinson, Lawn, Bhutta, Bahl, Chopra, Campbell, El Arifeen, Black and Cousens2017) to consolidate, combine and remove redundant items, maintaining an overall balance between the construct’s granularity and overall salient features. Third, M.P., E.A. and C.B. discussed and refined the wording and clarity of items. TDG members first participated in survey rounds to provide quantitative ratings on each item and their pertinence concerning the measured constructs (1 = not pertinent at all to 5 = extremely pertinent) and clarity (1 = extremely unclear to 5 = extremely clear). Fourth, TDG members participated in cognitive interviews to ensure consistency in the interpretation of the items (and response options) and to discuss face and content validity aspects. Finally, item wording and phrasing were improved through a consolidation step based on iterative discussion in a dedicated online session of TDG members, and independently revised by an external English mother-tongue for grammar and lexicon. The ensuing set of items formed the alpha version of the VELUT, which was used to assess a selection of 180 RCTs from recently published systematic reviews to support further item selection through IRT modelling (below).
Statistical analysis
Data collection
A team of seven evaluators (Appendix, page 8) used the tool to examine its application in a database of published RCTs on the effectiveness of MHPSS interventions (Papola et al., Reference Papola, Purgato, Gastaldon, Bovo, van Ommeren, Barbui and Tol2020; Purgato et al., Reference Purgato, Prina, Ceccarelli, Cadorin, Abdulmalik, Amaddeo, Arcari, Churchill, Jordans, Lund, Papola, Uphoff, van Ginneken, Tol and Barbui2023), which had been previously searched, selected and stored in a repository at the University of Verona.
Two assessors independently evaluated all 180 RCTs with the 16-item, alpha version of the VELUT, for the item response theory (IRT) modelling (described below) (MacCallum et al., Reference MacCallum, Browne and Cai2006; van der Linden, Reference van der Linden2018; van der Linden and Hambleton, Reference van der Linden and Hambleton1997). Next, we computed an overall score for each primary study based on the data obtained with the application of the VELUT, and then we prepared lists of deleted/combined/rephrased items – informed by both the empirical application of the VELUT and according to IRT results – for further discussion with the TDG.
Throughout the research process, the TDG engaged in methodological discussions online and during a 2-day in-person workshop in Lugano (Switzerland), including using an additional method to categorize RCTs.
Statistical analyses
We performed a formal psychometric evaluation to assess the VELUT’s reliability and validity as described below in this paragraph. This included an internal consistency analysis and the factor structure of the VELUT using IRT. Using the standardized version of Cronbach’s alpha (Guttman’s lambda-6), we assessed the internal consistency of the scale, the average inter-item correlation and the signal-to-noise ratio (Albanese et al., Reference Albanese, Butikofer, Armijo-Olivo, Ha and Egger2020), and interpreted IRT model results to identify redundant or overlapping items.
Item response theory
We explored potentially uninformative items by measuring the response clustering (i.e., response or item non-variance ≥60%). We used confirmatory factor analysis (CFA), consistent with IRT, to verify whether the items fit the anticipated domain structure of the VELUT (i.e., PICO and IOM framework, above). Graded response (IRT) models for non-binary answer options were applied to relate the responses to our underlying construct of interest. This approach enabled us to explore characteristics of items, including their correlation with other items (e.g., discrimination) and relative location along the promotion-to-treatment continuum (e.g., item locations or thresholds). To summarize item loadings and thresholds, we generated Item Characteristic Curves (ICCs) to display the probability of endorsing a given response as a function of the latent trait level (i.e., the promotion-to-treatment continuum, on the x-axis), modelled with a cumulative logistic distribution. The ICCs visually summarize item location and discrimination parameters (Orlando et al., Reference Orlando, Sherbourne and Thissen2000; van der Linden and Hambleton, Reference van der Linden and Hambleton1997).
We used graded response logistic models (GRM) (Gross et al., Reference Gross, Sherva, Mukherjee, Newhouse, Kauwe, Munsie, Waterston, Bennett, Jones, Green and Crane2014), allowing both location and discrimination to vary across items, with the slope of the ICC depicting the ability to distinguish between neighbouring levels of the latent trait (David and Norman, Reference David and Norman2015). We used the multidimensional IRT methods, specifically bifactor modelling (Reise, Reference Reise2012) to capture possible departures from unidimensionality of the VELUT scale, and we explored goodness of fit of these models with Akaike information criteria, the Bayesian information criteria (Maydeu-Olivares and Joe, Reference Maydeu-Olivares and Joe2006), and the M2 statistics. The latter, which follows an approximate chi-square distribution, was purposely designed to work well with categorical item data (Maydeu-Olivares and Joe, Reference Maydeu-Olivares and Joe2005). M2 is less sensitive to sample size, it is computationally efficient, and it performs well even in the case of sparse data (i.e., a high number of response patterns). Our analysis of item fit was evaluated using normalized residuals (Bollen, Reference Bollen1989). Global model fit of IRT models was evaluated with the root mean square error of approximation (RMSEA), Standardized Root Mean Square Residual (SRMR) and Comparative Fit Index (CFI).
For all items, we calculated the item information function, which refers to the precision of the item in measuring the latent trait. Item information sums to a Test Information Function, which depicts the combined coverage and precision of the scale items relative to the promotion-to-treatment continuum (Lord, Reference Lord1980). We statistically tested whether unidimensionality and local/conditional independence assumptions were met using parallel analysis with scree plots. Items that violated these assumptions were flagged for potential exclusion and discussed by the TDG. In addition, we evaluated the number of dimensions covered.
Finally, we used IRT for a preliminary validation study of the final version of the VELUT with 36 RCTs, reusing a subset of the 180 trials, that is, the 20% at each decile between the 10th and 90th percentiles of the original distribution of the latent trait. We applied the IRT GRM to evaluate the VELUT’s psychometric properties and generated ICCs, threshold distribution plots (TDPs) and test characteristic curves (TCCs). ICCs and TDPs provide a visualization of the location and discrimination parameters, the latent trait range covered by the items, and the response options thresholds. TCCs illustrate the precision of the VELUT in estimating the underlying latent trait (theta) at different points along the trait continuum, and how well the scale captures the latent trait (theta) across RCTs, with Marginal Reliability thresholds ≥0.90 interpreted as indicative of excellent reliability (Roth andTannenbaum, Reference Roth and Tannenbaum2004). Analyses were carried out using Stata 15 (2017).
Results
Devising items
The TDG collaboratively generated an initial pool of 18 items and progressively added 15 more, integrating multiple sources reflecting the PICO and IOM frameworks. The comprehensiveness (i.e., content validity) of the provisional list of 33 items was confirmed by achieving thematic saturation for the key design and conduct features of RCTs that may provide information on the relevant aspects of the IOM construct.
Selection of items and IRT results for the preliminary version of the VELUT
Of the 33 items developed, 6 were allocated to Population, 18 to Intervention, none to Comparison, five to Outcome in the PICO framework, and four to Study Setting (e.g., humanitarian, community). The items were evenly distributed across the IOM promotion, selective and indicated prevention and treatment continuum. The TDG members discussed the item list, eliminating redundant items (n = 3), rephrasing items (n = 17) and consolidating similar ones to maintain a balance between granularity and construct representation. We retained 16 items and improved their wording and clarity for the next steps.
The average ratings for pertinence and clarity of the 16 items, assessed by members of the TDG using a 1–5 Likert scale, were 3.76 (0.46, range: 2.85–4.37) and 3.75 (0.41, range: 3.16–4.67), respectively. Cognitive interviews confirmed that all TDG members understood all items and response options as intended and could interpret them correctly. Minor wording adaptations were made, such as the use of the term ‘screening’ instead of ‘selection’ for item 1, the use of the term ‘primarily’ instead of ‘initially’, the elimination of redundant words (i.e., ‘for example’), and the choice and clarification of abbreviations. Distributions of response options for the rated 180 RCTs based on the preliminary version of the VELUT are reported in the appendix (Appendix Table e1). Items u1 and u2; u2 and u3; u13, u14 and u16 were discussed with the TDG as their implementation was challenging according to the reviewers during the RCT assessment. The correlation matrix confirmed a strong correlation between items u1 and u2; u2 and u3; and between u13, u14 and u16. As a result, and after the conceptual in-person TDG discussion, some items were rephrased, one item (‘Were subgroup analyses done based on risk profiles and/or symptoms?’) was moved to the beginning of the tool to provide contextual information about the RCT being rated instead of providing information along the promotion-to-treatment continuum, and Item u16 (‘In this specific RCT, was the trial setting humanitarian?’) was deleted. Some other items were combined. Principal component analysis and CFAs, respectively, indicated one strong factor explained 57% of the total variance and 84% of the shared inter-item variance, while parallel analysis with a scree plot suggested strong statistical evidence for unidimensionality. Cronbach’s α of 0.94 suggested strong internal consistency reliability. The very poor fit of a Rasch model (RMSEA = 0.329; CFI = 0.92; SRMR = 0.313) requires better model specification. A 2-p graded-response IRT had better fit to data (RMSEA = 0.158; CFI = 0.98; SRMR = 0.076), and was much improved with a bifactor solution (RMSEA = 0.100; CFI = 0.99; SRMR = 0.059) that allowed specific factors to explain intercorrelations between p1, p2, p3 and o1, o2 in addition to common covariation explained by the common factor. The ICCs of the 16 items showed high heterogeneity of both item difficulty and slope (i.e., discrimination parameter) (Appendix Figure e4), with partially overlapping thresholds (u13 and u14) likely due to low or no endorsement of given response categories (Appendix Figure e5). The ICCs highlighted potentially redundant items (u6 and u11, which were highly correlated, r = 0.96) (Appendix Figure e6).
Further item selection led to minimal depreciation of the test information plot (Appendix Figure e7), and through intensive, iterative discussion during the in-person workshop, the TDG agreed that eight items were either duplicates, not informative or not applicable. Six were deleted, 2 combined and 1 was moved to the beginning of the VELUT and transformed into a classification question, not contributing to the overall score. An agreement was reached about the remaining eight items, their phrasing and respective response options, and a recognition to allow missing data on items in the form of missing/unclear information from the published RCT.
Validation study and final version of the VELUT
Table 1 reports the frequencies of responses to the eight items of the final version of the VELUT applied to 36 RCTs independently selected by the TDG psychiatric epidemiologist (AG). To prioritize representation of trials along the full promotion-to-treatment continuum in this restricted set of 36 RCTs, we selected 20% of trials at each decile between the 10th and 90th percentiles of the original distribution of 180 trials. This selection of trials ensured a decent spread of responses across all items, with the expected exception of the two outcome-related items (O1 and O2) for which answer options of indicated prevention prevailed. There were 12 ‘unclear’ responses out of 36 to item P3 regarding participants’ symptomatology at baseline, which was treated as missing data in graded-response IRT models and handled assuming missingness was random conditional on other variables in the model (e.g., MAR). All item discriminations (estimated via weighted least squares means and variance adjusted estimation) were high (>0.58) to very high (>0.93), indicating a high correlation between the VELUT items and the underlying latent promotion-to-treatment continuum of interest.
Table 1. Frequencies of responses from the application of the VELUT to 36 selected RCTs, and item parameters

Loadings: The strength of the relationship between an item and the latent trait. Threshold: Points on the latent trait (θ) where the probability of selecting a specific response category or higher is 50%. WLSMV: weighted least squares means and variance adjusted for categorical data, such as Likert-scale responses.
a Thresholds are on the latent trait scale, not on the observed scale.
The thresholds between response options (i.e., category boundaries) across all items ranged between corresponding theta values of −2.86 to 2.48, with most intermediate thresholds falling close to 0 on the latent trait continuum (assumed to be standard normally distributed [N{0,1}]) (Table 1). Figure 2 presents the ICC of the final version of the VELUT. The steepness of slopes (i.e., discrimination) of the ICC shown in the figure supports the ability of each item to distinguish between neighbouring levels of the underlying latent trait, the promotion-to-treatment continuum. ICCs show the broad range of location parameters across items.

Figure 2. Item characteristic curves.
The information and reliability of each item and the overall VELUT scale are illustrated in Figures 3 and 4, respectively.

Figure 3. Item information function (Mplus).

Figure 4. Test information function (Mplus).
The item information functions (Fig. 3) show high information for all items across a broad range of theta, except for O1 and O2. The test information function suggests that the VELUT is a strong measure of the latent trait across all the RCTs appraised (Fig. 4). Test information corresponding to a marginal reliability >0.90 was present across the entire observed range of the promotion-to-treatment latent trait, indicating excellent reliability of the scale, that is, a highly precise measurement of theta. The slope of the TCC (Fig. 5) is steep, indicating that the VELUT should differentiate well between RCTs with varying levels of promotion-to-treatment trait, and the linear relationship with the computed score suggests that the VELUT scores predictably and consistently increase as the latent trait does. Flat areas of the TCC were confined to extreme score values (Fig. 5). This result supports the good interpretability of the VELUT overall scores and their correspondence with the promotion-to-treatment continuum.

Figure 5. Test characteristic curve (TCC).
After extensive in-person discussion in the workshop, the TDG agreed that the overall VELUT score may be complemented with the computation of the frequencies of response options that correspond to the IOM domains of promotion, selective prevention, indicated prevention and treatment to recognize that one RCT can provide evidence for each domain, irrespective of its positioning across the promotion-to-treatment continuum. Both computational combinations of the responses can be used to inform the classification of RCTs in evidence synthesis (see https://www.iph.usi.ch/it/strumenti/velut). The VELUT tool is presented in Table 2.
Table 2. The Verona-Lugano Tool – VELUT

NA, unclear or insufficient information.
a The study population screened for the presence of a mental conditions or indicators and proxies of psychological distress or other mental health condition using a mental health instrument/symptom checklist.
b Consider the operationalization of the inclusion and exclusion criteria (if any).
c See study sample characteristics, typically reported in Table 1.
Discussion
This paper presents the development and preliminary psychometric validation of the VELUT, a novel measurement tool designed to position RCTs along the IOM continuum, ranging from mental health promotion to treatment. Tool development followed rigorous, iterative qualitative and quantitative psychometric procedures and involved international experts from complementary disciplines, who participated as members of an interdisciplinary TDG.
Through multiple rounds of item generation, refinement meetings, qualitative feedback, cognitive interviews and statistical analyses, we identified a final set of eight items demonstrating strong psychometric properties. Our findings indicate that the VELUT exhibits excellent internal consistency, high item discrimination and robust unidimensionality, supporting its construct validity in assessing the promotion-, prevention- or treatment focus of MHPSS interventions in global and in public mental health.
Results from IRT analyses revealed that the category thresholds for all eight items were widely spaced, indicating good item distribution and strong discrimination across varying levels of the latent trait representing the spectrum from promotion to treatment. All 8 items were found to be non-redundant and informative, ensuring comprehensive coverage of the underlying construct. The development of the VELUT addresses a critical gap in global mental health research by providing a structured, evidence-based approach to characterize intervention studies along a continuum that is frequently conceptually and methodologically ambiguous.
Many of our study RCTs were classified as indicated prevention, a finding that aligns with prevailing trends in the scientific literature, which focus on population groups already presenting symptoms of mental disorders (Cuijpers, Reference Cuijpers2003, Reference Cuijpers2022; Thom et al., Reference Thom, Jonas, Reitzle, Mauz, Holling and Schulz2024).
In general, conducting prevention research in mental health is not easy. It requires measuring the incidence of mental disorders to determine whether an intervention reduces the onset of disorders. This presents three distinct challenges. First, the accurate measurement of incidence in prevention trials requires delivering a formal diagnostic interview, a time-consuming process which necessitates the involvement of trained professionals compared to using self-assessed symptom scales (Dattani, Reference Dattani2023; Jensen-Doss and Hawley, Reference Jensen-Doss and Hawley2010). Prevention trials require substantial financial resources and access to trained professionals capable of administering diagnostic assessments. These resources can be difficult to mobilize, particularly in low- and middle-income settings where mental health services are limited, and mental health professionals are scarce. Second, mental health prevention research often requires a broad and diverse participant base and long-term follow-up periods (Legemaat et al., Reference Legemaat, Burger, Geurtsen, Brouwer, Spinhoven, Denys and Bockting2023). Mental health disorders often emerge gradually and can be influenced by several factors, including the structural and social determinants of mental health (Lund et al., Reference Lund, Brooke-Sumner, Baingana, Baron, Breuer, Chandra, Haushofer, Herrman, Jordans, Kieling, Medina-Mora, Morgan, Omigbodun, Tol, Patel and Saxena2018; Machado et al., Reference Machado, Alves and Patel2024; Rose-Clarke et al., Reference Rose-Clarke, Gurung, Brooke-Sumner, Burgess, Burns, Kakuma, Kusi-Mensah, Ladrido-Ignacio, Maulik, Roberts, Walker, Williams, Yaro, Thornicroft and Lund2020). Also, prevention research often targets at-risk populations, complicating recruitment efforts and introducing potential ethical considerations and constraints (World Health Organization, 2022). Third, boundaries between the different levels of prevention are often blurred in mental health research, may be subjectively disputed and interpreted. This is part of the conceptual challenge that guided the development of the VELUT.
The VELUT is a new tool in mental health. There are several critical appraisal tools of primary studies in the scientific literature (Medicine CEBM, 2023) conceptually different from the VELUT. For example, it shares several premises with the Cochrane Risk of Bias tool (RoB) (Higgins et al., Reference Higgins, Thomas, Chandler, Cumpston, Li, Page and Welch2023). Like the RoB, the VELUT is outcome-specific and transparent. However, while the Cochrane RoB is explicitly designed to evaluate the RoB and the internal validity of primary studies, focusing primarily on methodology, the VELUT provides a complementary classification, that is a structured approach for positioning RCTs along the IOM continuum from promotion to treatment, as well as for quantifying the extent to which each RCT emphasizes promotion/universal prevention, selective prevention, indicated prevention or treatment. This quantification, which no existing tools currently offer, is important, as 2 RCTs may have the same overall VELUT score yet exhibit markedly different profiles across each IOM domain. This mitigates potential misclassification bias and underscores the conceptual limitation of relying on aggregate scores alone. Consequently, no fixed threshold was imposed for categorizing RCTs based on total scores, and we recommend against the use of cutoffs that would be inherently arbitrary and may introduce bias.
Our study has limitations. First, our methodology is novel. However, to our knowledge, there are no established methodologies to develop a scale in evidence synthesis. Our methodology is grounded on existing CHNRI methodology, and considers two robust conceptual frameworks, namely PICO and IOM. Second, different from the WHO guidelines development groups, though the TDG was gender balanced, low- and middle-income countries were less represented than high-income countries. However, several members of the TDG have a long experience in humanitarian, resource-poor settings, which may have helped to account for some attention to socio-cultural factors in all the steps of the VELUT development process. We followed a balanced, multidisciplinary, highly participatory, democratic and replicable approach. Third, although we did confirm that the unidimensionality assumption of the IRT model was met, we cannot exclude deviations from local independence of items, although this seems unlikely. A strength of this study is that we used IRT models to inform item selection and to embed a preliminary validation conducted on a large set of relevant RCTs assessed by a trained assessment team. A further strength worth noting is that we also confirmed the known groups’ validity of the VELUT cherry-picking a few RCTs at each extreme of the spectrum, where VELUT scores were lowest and highest, respectively. This provides further empirical support for the construct validity of this new tool.
The TDG has developed a new pragmatic, free and accessible tool to address key challenges in evidence synthesis. Moreover, the VELUT aims to also promote better RCT design by introducing a public mental health perspective to existing methodological tools and resources. The VELUT helps to systematically – and in a pragmatic, user-friendly way – clarify which methodological characteristics align a primary study with a focus on promotion, prevention or treatment. This classification can be done with an overall VELUT score complemented by the computation of the frequencies of response options that correspond to the IOM domains of promotion, selective prevention, indicated prevention and treatment. This classification may guide key methodological decisions such as inclusion and exclusion criteria, outcome measures and intervention selection. The VELUT may also allow the classification of RCTs and, accordingly, guide subgroup and sensitivity analyses in systematic reviews and meta-analyses. Additionally, it will orient policymakers and funders when considering the criteria for RCTs to be funded or implemented in specific global and public health initiatives.
The VELUT has been specifically developed to address gaps in global mental health research and evidence synthesis. But its conceptual framework and structure lend to broader applicability across other areas of health research that may also share the challenge of misclassification of studies along the promotion to treatment continuum. For example, the VELUT can be applied to trials involving children and adolescents, as well as populations with severe mental conditions or comorbidities, since the items are designed to capture general trial features independent of age group or diagnostic complexity. This new tool has potential implications for the design of new clinical trials that could be more precisely focused on studying promotion, prevention or treatment. For example, inclusion and exclusion criteria, tools for outcome assessment and analyses and implementation settings could be defined using the VELUT. Implications also regard the clinical practice and the use of the evidence in the intervention choice; priority setting and strategic planning activities, and, last but not least, the development of calls for application and evaluation research applications for funding agencies and policy makers.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S2045796025100280.
Availability of data and materials
Data will be made available upon motivated request to the contact authors, and after discussion with the TDG.
Acknowledgements
Not applicable.
Author contributions
Conceptualization: M.P., C.B., E.A.; methodology: A.G., E.A., M.J.D.J., W.A.T., C.L., M.P., C.B.; planning and being responsible for data analysis: A.G., E.A., M.P., C.B.; writing, critically reviewing, editing: M.P., E.A., A.G., A.M.A., C.A., C.C., M.J.D.J., C.L., D.P., E.P., M. Sijbrandij., M. Silva, F.T., W.A.T., C.B. M.P. and C.B. are guarantors and responsible for this work, its conduction, and will have full access to the data. M.P. and E.A. are joint first authors.
Financial support
This research is partially funded by a grant for the internationalization of research of the University of Verona (Bando PIA-2023) and by the Swiss National Science Foundation (Global Mental Health workshop 226841). D.P. was funded by the European Union’s Horizon-MSCA-2021-PF01 research program under grant agreement N 101061648. The funder had no role in the design and conduct of the study; collection, management, analysis and interpretation of the data; preparation, review or approval of the manuscript; and decision to submit the manuscript for publication.
Competing interests
The author(s) declared that there were no conflicts of interest with respect to the authorship or the publication of this article.
Ethical standards
The protocol for this study was approved by the Ethics Committee of Università della Svizzera Italiana (CE_2024_04 of 25-01-2024).


