1. Introduction
Picture an impatient commuter pushing past someone to board the bus first. In response, another person in the queue openly rebukes the shover for their rudeness. As an observer, would you deem the intervenor’s response appropriate? How would you judge this person? In the depicted situation, an uninvolved third party invests their own resources (e.g., time and effort) to punish a perpetrator, despite not being directly affected. This type of behavior is referred to as third-party punishment (TPP; Eisenberg and Miller, Reference Eisenberg and Miller1987; Fehr and Fischbacher, Reference Fehr and Fischbacher2004) and can benefit societies by increasing cooperation and conformity to social norms (e.g., Fehr and Gächter, Reference Fehr and Gächter2002; Henrich et al., Reference Henrich, McElreath, Barr, Ensminger, Barrett, Bolyanatz, Cardenas, Gurven, Gwako, Henrich, Lesorogol, Marlowe, Tracer and Ziker2006; Spitzer et al., Reference Spitzer, Fischbacher, Herrnberger, Grön and Fehr2007). Countless studies, predominantly using economic game paradigms, demonstrate that approximately 50% of participants are willing to pay a personal price to inflict a reciprocal cost on an unfair social partner (for an overview, see Fehr and Fischbacher, Reference Fehr and Fischbacher2004; Nowak and Sigmund, Reference Nowak and Sigmund2005; van Dijk and De Dreu, Reference van Dijk and De Dreu2021).
But punishing is an ambivalent endeavor—and is perceived as such. On the one hand, it can serve as a positive signal: Punishing a perpetrator, like compensating a victim, indicates empathic concern and compassion for the victim (Klimecki et al., Reference Klimecki, Mayer, Jusyte, Scheeff and Schönenberg2016; Leliveld et al., Reference Leliveld, Van Dijk and Beest2012; Pfattheicher et al., Reference Pfattheicher, Sassenrath and Keller2019). Studies revealed that participants trust and reward third-party punishers more than individuals who do nothing (Barclay, Reference Barclay2006; Jordan et al., Reference Jordan, Hoffman, Bloom and Rand2016; Raihani and Bshary, Reference Raihani and Bshary2015; Vaish et al., Reference Vaish, Herrmann, Markmann and Tomasello2016) and are more inclined to choose them as new social partners (Batistoni et al., Reference Batistoni, Barclay and Raihani2022; Jordan et al., Reference Jordan, Hoffman, Bloom and Rand2016; Kurzban et al., Reference Kurzban, DeScioli and O’Brien2007). On the other hand, although sometimes conceptualized as an altruistic behavior, TPP can be detrimental and stigmatizing for punishers. Punishment is associated with anger and aggression (Eriksson et al., Reference Eriksson, Andersson and Strimling2016; Lee and Warneken, Reference Lee and Warneken2020; Rodrigues et al., Reference Rodrigues, Nagowski, Mussel and Hewig2018) and spiteful motives like revenge and retaliation (Raihani and Bshary, Reference Raihani and Bshary2019; van Doorn et al., Reference van Doorn, Zeelenberg, Breugelmans, Berger and Okimoto2018). Punishers induce fear in bystanders, who may conclude that they themselves risk harsh treatment in future interactions (Balafoutas et al., Reference Balafoutas, Nikiforakis and Rockenbach2016; Delton and Krasnow, Reference Delton and Krasnow2017; Fehr and Gächter, Reference Fehr and Gächter2002). In sum, TPP can signal both cooperation and aggression and, accordingly, has ambivalent reputational benefits (Dhaliwal et al., Reference Dhaliwal, Patil and Cushman2021; Panchanathan and Boyd, Reference Panchanathan and Boyd2004).
Research on TPP decisions and their evaluation typically relies on economic game paradigms, which document third-party responses in controlled, incentivized settings (for an overview, see Thielmann et al., Reference Thielmann, Böhm, Ott and Hilbig2021). This approach has revealed important insights into conditions under which punishment is viewed as an adequate intervention strategy. Characteristics of the witnessed norm violation can justify punishment, for instance, whether the perpetrator harmed the victim on purpose (Buckholtz et al., Reference Buckholtz, Asplund, Dux, Zald, Gore, Jones and Marois2008) or how grave the transgression was (e.g., Civai et al., Reference Civai, Huijsmans and Sanfey2019). Third parties use information such as severity (i.e., coins taken by the perpetrator) to calibrate a fitting response, and punishment becomes more likely and acceptable as transgression severity increases, in both adults (Stallen et al., Reference Stallen, Rossi, Heijne, Smidts, De Dreu and Sanfey2018) and children (Arini et al., Reference Arini, Wiggs and Kenward2021). However, the predominant use of decontextualized, anonymous interactions with minimal monetary penalties (‘for every cent you invest, the other loses three cents’) fails to capture the rich variety of punishment strategies available in everyday life and does not allow to address how core features of punishment, such as its type, impact punishment evaluations (see, e.g., Molho et al., Reference Molho, Tybur, Van Lange and Balliet2020; Raihani and Bshary, Reference Raihani and Bshary2019). Punitive strategies encompass various behaviors, including verbal rebukes, social ostracism, or physical aggression, which may be (interpreted as) more adequate than financial sanctions in many contexts (Balliet et al., Reference Balliet, Molho, Columbus and Dores Cruz2022; Molho et al., Reference Molho, Tybur, Van Lange and Balliet2020). Moving beyond abstract paradigms, whose limited external validity and contextual flexibility challenge the generalizability of findings (e.g., discussed in Batistoni et al., Reference Batistoni, Barclay and Raihani2022; Raihani et al., Reference Raihani, Thornton and Bshary2012), we aim at investigating which punishment strategies enhance or harm the punisher’s reputation and under what conditions specific punitive actions are viewed as useful tools for enforcing norms in more naturalistic settings.
1.1. Types of punishment
Monetary, property-oriented sanctions, as used in economic game studies, are a key and common strategy for prosecuting wrongdoers in real-life societies via legal institutions (Carlsmith et al., Reference Carlsmith, Darley and Robinson2002; Guala, Reference Guala2012; Schoenmakers et al., Reference Schoenmakers, Hilbe, Blasius and Traulsen2014), especially for minor offenses (Statistisches Bundesamt, 2022). Less attention has been devoted to the broad range of informal punitive responses that people employ in daily life settings. Psychological strategies can be observed from an early age: Toddlers express verbal protest upon witnessing property damage (Vaish et al., Reference Vaish, Missana and Tomasello2011), and adults confront perpetrators directly through rebukes and harsh criticism (Masclet et al., Reference Masclet, Noussair, Tucker and Villeval2003; Wiessner, Reference Wiessner2005), as indicated in our initial example. More indirect social tactics encompass gossiping (Feinberg et al., Reference Feinberg, Willer and Schultz2014; Giardini et al., Reference Giardini, Vilone, Sánchez and Antonioni2021) and social exclusion (Beersma and Van Kleef, Reference Beersma and Van Kleef2011; Dimitroff et al., Reference Dimitroff, Harrod, Smith, Faig, Decety and Norman2020; Dunbar, Reference Dunbar2004; Guala, Reference Guala2012), with the latter leading to higher contributions in public goods experiments (Cinyabuguma et al., Reference Cinyabuguma, Page and Putterman2005) and reduced selfish behavior in both human (Turnbull, Reference Turnbull1961) and animal groups (Carter, Reference Carter2014; Krama et al., Reference Krama, Vrublevska, Freeberg, Kullberg, Rantala and Krams2012). Next, justice restoration can extend to corporal punishment. Physical violence against individuals or groups, for instance whipping or branding, was acceptable across European society until the late eighteenth century (King, Reference King2006; Straus, Reference Straus1991). Even today, more than 28,000 people are under sentence of death worldwide (Death Penalty Information Center, 2023). In everyday life, corporal punishment persists as a prevalent strategy for addressing misbehavior, particularly in child rearing, with about 17% of adolescents globally reporting recent experiences of such disciplining at school or home (Elgar et al., Reference Elgar, Donnelly, Michaelson, Gariépy, Riehm, Walsh and Pickett2018).
Which of these punishment types—property-oriented, corporal, or psychological—are considered appropriate—and when? Given its immediate and confrontational nature, corporal punishment as a direct violation of another’s body is often associated with aggressiveness and hostility (Franklin-Luther and Volk, Reference Franklin-Luther and Volk2022; Gershoff, Reference Gershoff2002) and can trigger stronger moral outrage and anger in observers than non-corporal harm (Asp et al., Reference Asp, Gullickson, Warner, Koscik, Denburg and Tranel2019; Eriksson et al., Reference Eriksson, Andersson and Strimling2016; Jackson et al., Reference Jackson, Rainville and Decety2006). Accordingly, children perceived spanking as less fair compared to non-corporal punishments like withdrawing toys (Vittrup and Holden, Reference Vittrup and Holden2010). In contrast, property-oriented punishment is less emotionally charged and may be considered less effective in inducing behavioral change (Larzelere and Kuhn, Reference Larzelere and Kuhn2005; Nelissen and Mulder, Reference Nelissen and Mulder2013). Due to its formal and calculable character, though, financial punishment may communicate clear messages about norm violations and carry predictable external incentives against future misconduct through cost-benefit logic (Guala, Reference Guala2012; Molho et al., Reference Molho, Twardawski and Fan2022). A potent person-related alternative, avoiding both physical and property harm, is psychological punishment. It was advocated as the preferred strategy in everyday life (Elbla, Reference Elbla2012; Molho et al., Reference Molho, Tybur, Van Lange and Balliet2020), promoted punishers’ perceived competence (Chen and Xu, Reference Chen and Xu2020), and fostered cooperation more effectively than material sanctions (Nelissen and Mulder, Reference Nelissen and Mulder2013; Wu et al., Reference Wu, Balliet and Van Lange2016) by signaling strong communal condemnation (Beersma and Van Kleef, Reference Beersma and Van Kleef2011; Feinberg et al., Reference Feinberg, Willer and Schultz2014). So far, systematic comparisons and relative preferences between these three punishment strategies are sparse, especially within a unified empirical framework (but see, e.g., Vittrup and Holden, Reference Vittrup and Holden2010, and Eriksson et al., Reference Eriksson, Andersson and Strimling2016, for comparisons between punishment types).
1.2. Proportionality of punishment
Another critical feature shaping punishment appropriateness may be the fit between punishment and prior transgression. A central perspective on what constitutes ‘adequate’ punishment is the concept of retribution (Carlsmith, Reference Carlsmith2006), according to which TPP should reflect the nature and severity of the harm inflicted so that the wrongdoer experiences consequences similar to their own misdeeds. With different punishment strategies available, the third party can choose response types that match or differ from the transgression. When both measures align, observers may consider this as restoring balance, contributing to a subjective sense of justice (Goodwin and Benforado, Reference Goodwin and Benforado2015; Hofmann et al., Reference Hofmann, Brandt, Wisneski, Rockenbach and Skitka2018), and fostering the belief that the punisher acted thoughtfully and in accordance with societal expectations (Aldrovandi et al., Reference Aldrovandi, Wood and Brown2013; Arini et al., Reference Arini, Wiggs and Kenward2021). To date, it remains an open empirical question whether similarity between the type of transgression and the type of punishment enhances perceived adequacy and the legitimacy of the punitive act.
1.3. Severity of punishment
The concept of ‘eye for an eye’ extends to the severity of the transgression and subsequent punishment. Third parties often deploy punishment proportionately to the severity of the prior norm violation (Carlsmith et al., Reference Carlsmith, Darley and Robinson2002; Fehr and Gächter, Reference Fehr and Gächter2002; Heffner and FeldmanHall, Reference Heffner and FeldmanHall2019), enhancing judgments of adequacy, fairness, and morality among observers (Balafoutas et al., Reference Balafoutas, Nikiforakis and Rockenbach2016; Carlsmith et al., Reference Carlsmith, Darley and Robinson2002). However, when punishment severity clearly exceeded that of the transgressions, punishers were seen as impulsive, less trustworthy, or even vengeful (Brandt et al., Reference Brandt, Hauert and Sigmund2003; Dhaliwal et al., Reference Dhaliwal, Patil and Cushman2021). Studies comparing strong versus weak punishment tend to show greater disapproval for harsher sanctions, whether corporal or property-oriented (Eriksson et al., Reference Eriksson, Andersson and Strimling2016; Lee and Warneken, Reference Lee and Warneken2020; Liu et al., Reference Liu, Yang and Wu2021; Solomon and Lee, Reference Solomon and Lee2025). In contrast, harsher sanctions could effectively serve deterrent functions by clearly signaling strong criticism and preventing future violations by the culprit or by others (Balliet et al., Reference Balliet, Mulder and Van Lange2011; Delton and Krasnow, Reference Delton and Krasnow2017). Despite contradictory findings, the evaluation of punishment severity may critically depend on the situation—and the punishment type. Given the formalized setups typically used to investigate severity effects, it remains unknown when, how, and why severity matters.
1.4. The current study
We aimed to systematically investigate how type and severity of punishment and their relation to the norm violation affect evaluations of punishment and punisher in a formalized manner while capturing the complexity of real-life social interactions. Three situational TPP features were manipulated: type of transgression, type of punishment (property-oriented, corporal, or psychological; Experiment 1), and severity of punishment (weak or strong; Experiment 2). Using a vignette approach (as in, e.g., Buckholtz et al., Reference Buckholtz, Asplund, Dux, Zald, Gore, Jones and Marois2008; Eriksson et al., Reference Eriksson, Andersson and Strimling2017; Lieberman and Linke, Reference Lieberman and Linke2007; Martin et al., Reference Martin, Jordan, Rand and Cushman2019), we created and validated a diverse set of written hypothetical scenarios. Each vignette followed the classic TPP structure, entailing a transgression between a perpetrator and a victim and a subsequent third-party intervention. All punishments in our studies were acts of peer-to-peer punishment carried out by ordinary third parties who had no formal authority or institutional role over the perpetrator, reflecting decentralized, informal enforcement of social norms as observed in everyday interpersonal contexts (e.g., Fehr and Fischbacher, Reference Fehr and Fischbacher2004).
Both experiments measured the effects of the manipulations on perceived punishment adequacy (Behnke et al., Reference Behnke, Strobel and Armbruster2020; Hopkins et al., Reference Hopkins, Dodd, Nolan and Bartels2022) and evaluations of the punisher’s warmth and competence, two fundamental dimensions people typically rely on when judging others (stereotype content model; Abele et al., Reference Abele, Ellemers, Fiske, Koch and Yzerbyt2021; Fiske et al., Reference Fiske, Cuddy, Glick and Xu2002). While warmth includes attributes like kindness, empathy, and trustworthiness, competence pertains to capability and efficiency. In terms of social approach, perceived warmth can foster inclinations to form friendships (Cuddy et al., Reference Cuddy, Fiske and Glick2008), and perceived competence can contribute to viewing someone as a suitable leader (Cuddy et al., Reference Cuddy, Glick and Beninger2011; Fiske et al., Reference Fiske, Cuddy and Glick2007). Correspondingly, we assessed participants’ hypothetical willingness to interact with the punisher as a friend or team leader, extending findings on reputational benefits of altruistic punishment (e.g., Barclay, Reference Barclay2006; Jordan et al., Reference Jordan, Hoffman, Bloom and Rand2016; Santos et al., Reference Santos, Rankin and Wedekind2011).
2. Experiment 1
Experiment 1 manipulated the type of transgression toward the victim and the type of punishment subsequently administered by the third party. Both acts could take property-oriented, corporal, or psychological form (examples below). After each vignette, participants rated the punishment, the punisher, and their willingness to interact with the punisher. We expected more favorable ratings of punishment and punisher following psychological and corporal transgressions compared to property-oriented transgressions, as the former are more direct and likely to inflict more emotional or physical pain, thus warranting punishment (H1; Asp et al., Reference Asp, Gullickson, Warner, Koscik, Denburg and Tranel2019; Dimitroff et al., Reference Dimitroff, Harrod, Smith, Faig, Decety and Norman2020; Jackson et al., Reference Jackson, Rainville and Decety2006; Smetana and Ball, Reference Smetana and Ball2019). Regarding punishment types, the literature supports diverging hypotheses. For one, property-oriented punishment may elicit more positive evaluations than corporal and psychological strategies (H2a; Asp et al., Reference Asp, Gullickson, Warner, Koscik, Denburg and Tranel2019; Vittrup and Holden, Reference Vittrup and Holden2010), given its common and formal use and its perception as less aggressive since it targets objects rather than the perpetrator directly (Eriksson et al., Reference Eriksson, Andersson and Strimling2016). Alternatively, property-oriented punishment could receive more negative ratings than corporal and psychological punishment (H2b), as observers may view social sanctions as more effective in promoting cooperation, ensuring a safe environment, and enhancing the group’s net benefit (Nelissen and Mulder, Reference Nelissen and Mulder2013; Wu et al., Reference Wu, Balliet and Van Lange2016). Finally, we expected more beneficial evaluations across all five rating dimensions when the punishment type (e.g., property-oriented) matches the preceding transgression type (e.g., also property-oriented; H3; Aldrovandi et al., Reference Aldrovandi, Wood and Brown2013; Carlsmith, Reference Carlsmith2006; Hofmann et al., Reference Hofmann, Brandt, Wisneski, Rockenbach and Skitka2018).
2.1. Methods
We report how the sample size was determined, all manipulations, collected measures, and data exclusions (Simmons et al., Reference Simmons, Nelson and Simonsohn2011). The preregistration for the experiments (https://osf.io/s7b3d/overview) and the stimulus material (https://osf.io/fr9bt/overview) are available on the Open Science Framework.
2.1.1. Development of study material
We created 24 scenarios, each depicting interactions among a perpetrator, a victim, and a punisher across distinct social contexts (learning, work, everyday life, and social relationships; see Figure 1). Each context included six different TPP scenarios, for example, involving roles like teachers, consultants, or actors in work-related settings. We manipulated each of the 24 scenarios according to all experimental conditions (Experiment 1: all combinations of type of transgression and type of punishment; Experiment 2: all combinations of severity of punishment and type of transgression/punishment) resulting in nine (Experiment 1) or six (Experiment 2) vignettes per scenario (see Table S1 in the Supplementary Material for an exemplary vignette set).Footnote 1 To illustrate the design based on one scenario, in an everyday life vignette involving handball players, the manipulation of transgression type looked as follows: property-oriented: taking a teammate’s headphones; corporal: pushing a teammate aside; and psychological: mocking a teammate’s cheap clothing, while punishment type was manipulated as follows: property-oriented: tearing off the perpetrator’s keychain; corporal: seizing the perpetrator by the arm; and psychological: giving the cold shoulder to the perpetrator during training. When manipulating punishment severity in Experiment 2, in the corporal punishment example, the punisher now either pinches the shover’s arm (weak punishment, original punishment downgraded) or pulls the shover back forcefully, causing them to fall (strong punishment, original punishment intensified).

Figure 1 Overview of experiments, setup, and procedure.
Note: In the upper right part of the figure, the two experiments and their independent variables (IVs) are displayed. Each trial involved participants reading a vignette depicting a transgression and subsequent punishment, always representing one specific condition. Participants’ task was to picture the scenario and then rate the punishment, the punisher, and their interaction tendency on rating scales (displayed below the questions). These ratings comprised the five dependent variables (DVs).
Each of the three norm violation categories (property-oriented, corporal, and psychological) included various subtypes of offenses, guided by previous vignette-based research (e.g., Buckholtz et al., Reference Buckholtz, Asplund, Dux, Zald, Gore, Jones and Marois2008; Martin et al., Reference Martin, Jordan, Rand and Cushman2019) and everyday-life observations. As such, our setup allowed us to explore whether the overall effects of the type of transgression/punishment on participants’ evaluations were, for instance, driven by specific behaviors. Different harm subtypes could trigger distinct emotional and moral responses, subsequently shaping participants’ judgments of punishment appropriateness.
All vignettes were developed and tested in a pilot experiment to validate their suitability as experimental stimuli. For that purpose, we deconstructed the vignettes into individual, decontextualized components displaying the transgressions and punishments separately (e.g., ‘a person pinches another person’s arm’). Thirty-nine participants (M = 22.9 years, 28 female and 11 male) rated the statements on a seven-point Likert-scale ranging from not at all morally reprehensible, wrong, or bad to extremely morally reprehensible, wrong, or bad. This approach allowed us to assess whether the individual components of our scenarios carried comparable moral weight, independent of their narrative context. Based on the results, we adjusted the vignettes to ensure comparable perceived wrongness across transgression and punishment types and severities (see Text S1 in the Supplementary Material). Further, we created male and female versions while maintaining gender consistency within vignettes and featured various age groups (children, adolescents, young adults, adults, and seniors) with identical ages within vignettes. Additionally, after expert focus-group discussions, we excluded punishments related to extensive planning or illegal activities and those that could be viewed as serving the punisher’s self-interest. Our refined scenarios aimed to reach a balance between external validity and experimental control to mitigate potential confounds, enabling a solid foundation for our investigations and assuring that observed differences in rating scores in the later experiments could be attributable to variations in the independent variables (IVs).
2.1.2. Participants
As outlined in the preregistration (https://osf.io/s7b3d/overview), required sample sizes were calculated a priori (using G*Power; Faul et al., Reference Faul, Erdfelder, Lang and Buchner2007) with α = .05, power 1 − β = .8 and an estimated medium effect size f = .25 (as a conventional benchmark in the absence of robust empirical precedents) for both main effects and the interaction, resulting in a total of N = 34 per experiment. To account for possible dropouts and achieve equal cell sizes for all vignettes (requiring a multitude of 12, see Task and procedure section), we added 14 participants for a final sample of N = 48 per experiment. The present study adheres to the ethical standards of the 1964 Declaration of Helsinki regarding participant treatment in research. Participants gave informed consent prior to starting the respective experiment.
The 48 participants of Experiment 1 were students who were recruited via the University of Würzburg’s recruitment platform and compensated with course credit, or volunteers who participated without financial compensation. All participants met the preregistered inclusion criteria, which required fluency in German and passing at least 75% (three out of four) of attention checks designed to ensure concentration and engagement in the task. Mean age was 23.4 years (SD = 5.9, range 19–57) with 11 participants identifying as male, 36 as female, and 1 as diverse.
2.1.3. Task and procedure
The experiment, programmed in PsychoPy (Peirce et al., Reference Peirce, Gray, Simpson, MacAskill, Höchenberger, Sogo, Kastman and Lindeløv2019), was accessed online via a link shared either through the recruitment platform or directly with participants and required a computer or laptop with a keyboard and mouse/trackpad. After completing demographic questions concerning age and gender, participants were instructed to read short vignettes describing hypothetical interactions between three parties. They were advised to carefully read each vignette and picture the situation as an outside observer. The instructions emphasized that all transgressions and punishments were intentional and that punishments always targeted the perpetrator.
Two practice trials preceded the experimental trials. Each trial started with a written vignette, followed by participants responding to five questions on seven-point Likert scales (see Figure 1). Questions for vignettes describing interactions between children required slight adjustments. The five questions were: (1) How adequate/appropriate do you perceive the punishment? (2) How trustworthy/upright/benevolent do you perceive the punisher? (3) How dominant/competent do you perceive the punisher? (4) Would you like the punisher to be your friend? [child option: If you were a kid yourself, how much would you like to befriend the punisher?] (5) Would you like the punisher to be your team leader? [child option: If you were a kid yourself, how much would you like to be in a group led by the punisher?]. All scales ranged from not at all to very, except the first question (punishment adequacy), which ranged from too weak to too strong with adequate in the center. For use in assuring data quality, four attention checks were intermixed with the vignettes, requiring participants to press a specified pole of the scale (e.g., Please press [not at all / very]).
2.1.4. Design and data analysis
Each experimental condition (according to the two factors type of transgression and type of punishment, each with the levels property-oriented, corporal, and psychological; nine conditions in total) was presented four times. The 36 experimental trials covered all four situational contexts (learning, work-related, everyday life, and social relational). To avoid memory effects, participants were exposed to a maximum of two vignettes from each of the 24 scenarios. Half of the trials involved male, and the other half female characters. To counterbalance gender and conditions, participants were evenly assigned to each of the 12 versions of possible vignette combinations.
Experiment 1 followed a 3 × 3 within-subjects design with the factors type of transgression (property-oriented, corporal, or psychological) and type of punishment (property-oriented, corporal, or psychological). Separate repeated measures analyses of variance (rmANOVA) were conducted for each of the five ratings. Greenhouse–Geisser corrections were applied if Mauchly’s test showed a violation of sphericity. In case of significant main effects or interactions, we computed paired t-tests to locate the origin of the effects, with Bonferroni-Holm adjustments for multiple comparisons. Additionally, we calculated correlations for all five average ratings of the dependent variables (DVs). We report ηp 2 and Cohen’s d as effect sizes.
Finally, mediation analyses clarified whether punishment type affected the participants’ tendency to interact with the punisher, mediated by the evaluation of the punisher. Perceived warmth was tested as a mediator for the willingness to befriend the punisher, and perceived competence as a mediator for the willingness to join a team led by the punisher (according to the stereotype content model; Abele et al., Reference Abele, Ellemers, Fiske, Koch and Yzerbyt2021; Fiske et al., Reference Fiske, Cuddy, Glick and Xu2002). Mediation models were calculated using a bootstrapping-based approach with 1000 resampling simulations (Preacher and Hayes, Reference Preacher and Hayes2008).
2.2. Results
The data and analysis codes for all experiments are available on the Open Science Framework (https://osf.io/fr9bt/overview). For Experiment 1, rating means per condition and for all DVs are displayed in Figure 2 and reported in Table 1. Detailed pairwise comparisons can be found in Table 2.

Figure 2 Rating means per condition and DVs and correlations between DVs (Experiment 1).
Note: (A) Perceived adequacy of the punishment. Four on the rating scale indicates appropriate adequacy, with lower values indicating too weak punishment and higher values indicating too strong punishment. (B) Perceived warmth of the punisher. (C) Perceived competence of the punisher. (D) The correlation matrix displays Pearson correlations for all five ratings averaged across conditions. (E) The hypothetical willingness to befriend the punisher. (F) The hypothetical willingness to be part of a team led by the punisher. Error bars indicate standard errors. ns p ≥ .05, * p < .05, ** p < .01, *** p < .001. Detailed violin plots displaying participant-level dispersion are provided in the Supplementary Figure S1.
Table 1 Rating means (M) and standard deviations (SD) for types of punishment by types of transgression (Experiment 1)

Table 2 Pairwise comparisons of ratings between types of punishment, separately for types of transgression (Experiment 1)

Note: Prop = property-oriented, corp = corporal, psych = psychological. Cohen’s ds (d) are displayed as effect sizes. * p < .05, ** p < .01, *** p < .001.
2.2.1. Adequacy of punishment
Contrary to H1, type of transgression did not systematically affect the perceived adequacy of the ensuing punishment (F(2,94) = 2.77, p = .068, ηp 2 = .06). However, we observed a main effect of punishment type (F(2,94) = 14.35, p < .001, ηp 2 = .23); that is, both property-oriented and corporal punishment were perceived as less adequate (i.e., too strong) than psychological punishment (t(47) ≥ 4.13, p ≤ .001, d ≥ 0.60; averaged across transgression types, see Table S2 in the Supplementary Material for details). The difference between property-oriented and corporal punishment was not significant (t < 1). Punishment and transgression type did not significantly interact (F < 1).
2.2.2. Warmth of the punisher
Evaluations of the punisher’s warmth did not vary between transgression types (F(2,94) = 3.07, p = .051, ηp 2 = .06). By contrast, perceived warmth depended on the administered punishment type (F(2,94) = 27.46, p < .001, ηp 2 = .37). In line with H2b, property-oriented punishers obtained lower ratings than corporal and psychological punishers (t(47) ≥ 4.10, p < .001, d ≥ 0.59; see Table S2 in the Supplementary Material). Furthermore, ratings for corporal punishers were lower than those for psychological punishers (t(47) = 3.12, p = .003, d = 0.45). As reflected in the significant interaction (F(4,188) = 7.48, p < .001, ηp 2 = .14), this pattern differed depending on the type of preceding transgression: Property-oriented punishers were perceived as less warm than corporal and psychological punishers, except when reacting to property-oriented transgressions (statistical values (t, p, d) are depicted in Table 2). Similarly, corporal punishers were perceived as less warm than psychological punishers, except when reacting to corporal transgressions. Psychological punishers received the highest evaluations when punishing psychological transgressions. Hence, supporting H3, warmth evaluations improved when the punitive act matched the transgression.
2.2.3. Competence of the punisher
Comparable to warmth ratings, we found no significant main effect of transgression type (F < 1), but a significant main effect of punishment type on the perceived competence of the punisher (F(2,94) = 15.32, p < .001, ηp 2 = .25). Mean ratings were lower for property-oriented punishers than for corporal and psychological punishers (t(47) ≥ 2.51, p ≤ .016, d ≥ 0.36; see Table S2 in the Supplementary Material). Corporal punishers were also rated as less competent than psychological punishers (t(47) = 2.90, p = .012, d = 0.42). Again, type of transgression interacted with type of punishment (F(4,188) = 11.49, p < .001, ηp 2 = .20, ε = .85; GG-corrected). Participants evaluated property-oriented punishers as least competent, except when addressing property-oriented transgressions (for statistical indices, see Table 2), again in line with H3. Similarly, they judged corporal punishers as significantly less competent than psychological punishers, except when responding to corporal transgressions. Evaluations of psychological punishers were highest after psychological transgressions.
2.2.4. Tendency to accept the punisher as a friend
Once more, type of transgression did not affect the punisher evaluation (F(2,94) = 1.05, p = .353, ηp 2 = .02), but type of punishment did (F(2,94) = 22.71, p < .001, ηp 2 = .33). We observed less willingness to befriend property-oriented punishers than corporal and psychological punishers (t(47) ≥ 2.90, p ≤ .006, d ≥ 0.42), as well as corporal compared to psychological punishers (t(47) = 3.75, p < .001, d = 0.54; see Table S2 in the Supplementary Material). These preferences again depended on transgression type, indicated by a significant interaction (F(4,188) = 8.10, p < .001, ηp 2 = .15). Confirming H3, property-oriented punishers were liked less as friends than those using corporal and psychological punishments, except when punishing property-oriented transgressions (see Table 2). Corporal punishers were liked less than psychological punishers, unless punishments followed corporal transgressions. The highest willingness to befriend a punisher was found when psychological punishers responded to psychological transgressions.
2.2.5. Tendency to accept the punisher as a team leader
In contrast to the other ratings, we found a small main effect of transgression type (F(2,94) = 4.26, p < .05, ηp 2 = .08). Punishers intervening after corporal compared to psychological transgressions were less preferred as team leaders (t(47) = 2.66, p = .033, d = 0.38). Type of punishment impacted the willingness to be part of a team led by the punisher (F(2,94) = 26.78, p < .001, ηp 2 = .36), with mean ratings being lower for property-oriented punishers than for corporal and psychological punishers (t(47) ≥ 3.12, p ≤ .003, d ≥ 0.45), and for corporal than psychological punishers (t(47) = 4.05, p < .001, d = 0.58; see Table S2 in the Supplementary Material). The significant interaction between punishment and transgression type (F(4,188) = 8.84, p < .001, ηp 2 = .16) revealed that participants liked property-oriented punishers less in the role of team leaders than corporal punishers addressing corporal transgressions and psychological punishers addressing either corporal or psychological transgressions (see Table 2). This aversion for property-oriented punishers was absent after property-oriented transgressions. Similarly, corporal punishers were less preferred as team leaders than psychological punishers, unless they reacted to corporal transgressions. Psychological punishers responding to psychological transgressions elicited the highest preference as team leaders.
2.2.6. Mediation analyses
When investigating how warmth and competence ratings related to interaction tendencies, we observed strong correlations between assigned warmth and the willingness to befriend the punisher (r(1726) = .76, p < .001) and between assigned competence and the willingness to accept the punisher as a team leader (r(1726) = .70, p < .001), supporting the idea that perceived warmth fosters the inclination to befriend someone, and perceived competence the willingness to accept someone as a leader (e.g., Cuddy et al., Reference Cuddy, Fiske and Glick2008, Reference Cuddy, Glick and Beninger2011; Fiske et al., Reference Fiske, Cuddy, Glick and Xu2002, 2007). It is noteworthy, however, that warmth also correlated with leader preference, and competence with friendship choice (see Figure 2D). Mediation analyses tested how punishment type (using property-oriented vs. psychological as the two most diverging types) influenced friendship or leadership choices through perceived warmth or competence. Results showed that perceived warmth rendered the direct effect of punishment type on the inclination to befriend the punisher non-significant, suggesting full mediation (see panel A of Figure 3). Perceived competence partially mediated the effect of punishment type on the willingness to be led by the punisher (see panel B of Figure 3).

Figure 3 Results of mediation analyses for the effect of type of punishment on interaction tendencies mediated by perceptions of the punisher (Experiment 1).
Note: prop = property-oriented, psych = psychological. c’ indicates the direct effect of type of punishment on the interaction tendency, c indicates the total effect c’ + a * b. (A) The perceived warmth of the punisher was revealed as a full mediator for the willingness to befriend the punisher. (B) The perceived competence of the punisher was revealed as a partial mediator for the willingness to accept the punisher as a team leader.
2.3. Discussion
A consistent pattern was observed across all DVs. While type of transgression played a minor role (Hypothesis 1), type of punishment clearly shaped perceived punishment adequacy, warmth, and competence attributed to the punisher, and the willingness to interact with the punisher as a friend or team member (evaluations for property-oriented < corporal < psychological). Although we initially grouped corporal and psychological punishments as direct, person-related punitive measures, our findings reveal that observers evaluated them differently. Overall, results support Hypothesis 2b and align with the literature highlighting the acceptance and effectiveness of socially oriented punishments (Chen and Xu, Reference Chen and Xu2020; Heffner and FeldmanHall, Reference Heffner and FeldmanHall2019; Kupfer and Tybur, Reference Kupfer and Tybur2023; Vittrup and Holden, Reference Vittrup and Holden2010).
In line with Hypothesis 3, less favored punishments rehabilitated after congruent transgressions in all DVs except the rating of punishment adequacy. Observers seem to possess an intuitive sense of the relative seriousness of different transgression types and support punishment interventions that fit these infractions (Sznycer and Patrick, Reference Sznycer and Patrick2020), judging them as especially competent, fair, and justified (Arini et al., Reference Arini, Wiggs and Kenward2021; Carlsmith et al., Reference Carlsmith, Darley and Robinson2002; Hofmann et al., Reference Hofmann, Brandt, Wisneski, Rockenbach and Skitka2018).
Taken together, Experiment 1 results show that participants prioritized the punisher’s action over the perpetrator’s action in their evaluations, indicating that punishment characteristics fundamentally shape bystanders’ opinions. Moreover, participants considered punishment contextually in relation to the transgression rather than in isolation. To explore this dynamics further, Experiment 2 shifted the focus from matching punishment and transgression types to manipulating another central aspect of the punitive act—its severity (e.g., Batistoni et al., Reference Batistoni, Barclay and Raihani2022; Liu et al., Reference Liu, Yang and Wu2021; Solomon and Lee, Reference Solomon and Lee2025; Zhang and Qi, Reference Zhang and Qi2024). Critically, we also investigated how severity interacts with punishment type to uncover more complex patterns in observers’ judgments (see, e.g., Peterson, Reference Peterson2024).
3. Experiment 2
Experiment 2 investigated how punishment type and severity affect the evaluations of punishment and punisher. We included property-oriented, corporal, and psychological punishments that, unlike in Experiment 1, always aligned with transgression types. As a novel factor, we introduced punishments varying in severity (weak or strong), with both levels deviating equally from a medium-severity transgression. As before, participants rated the punishment (adequacy), the punisher (warmth and competence), and their willingness to interact with the punisher (as friend and team leader). We hypothesized that stronger punishment, potentially being perceived as irrational and impulsive (Dhaliwal et al., Reference Dhaliwal, Patil and Cushman2021; Lee and Warneken, Reference Lee and Warneken2020; Zhang and Qi, Reference Zhang and Qi2024), would elicit more negative ratings than weaker punishment (H4). Building on findings of Experiment 1, we expected property-oriented punishment to receive more negative evaluations than corporal or especially psychological punishment (H2b). Finally, we investigated whether severity ratings depend on the type of punishment. Given that severe corporal harm can be particularly detrimental and fear-inducing (Ripoll-Núñez and Rohner, Reference Ripoll-Núñez and Rohner2006; Smetana and Ball, Reference Smetana and Ball2019), for instance in child upbringing contexts (Brown et al., Reference Brown, Holden and Ashraf2018; Larzelere and Kuhn, Reference Larzelere and Kuhn2005), we predicted that negative appraisal of severe punishment would be most pronounced for corporal punishment (H5).Footnote 2
3.1. Methods
3.1.1. Participants
We collected data of 50 participants, comprising students enrolled at the University of Würzburg and volunteers. Students received compensation in the form of course credit, and volunteers did not receive any compensation. Two participants were excluded for failing the preregistered attention checks (see Participants section of Experiment 1). The final sample (N = 48) consisted of 19 male, 28 female, and 1 diverse-gendered individuals with a mean age of 25.5 years (SD = 6.4, range 20–60; two participants did not provide age information).
3.1.2. Task and procedure
See Experiment 1.
3.1.3. Design and data analysis
The design closely resembled that of Experiment 1, with changes to the within-subjects factors and modifications to the number of experimental trials. Type of transgression remained congruent with type of punishment and was no longer treated as a within-subjects factor. Instead, severity of punishment was introduced as a new within-subjects factor (see Figure 1). Each experimental condition (according to the two factors severity of punishment with the levels weak and strong; and type of punishment with the levels property-oriented, corporal, and psychological; six conditions in total) was presented four times. The 24 experimental trials covered all four situational contexts (learning, work-related, everyday life, and social relational), with participants encountering one vignette from each of the 24 scenarios. Counterbalancing was equivalent to Experiment 1.
The data were analyzed similarly to Experiment 1, employing rmANOVA for the five rating questions. Experiment 2 followed a 2 × 3 within-subjects design with the factors severity of punishment (weak vs. strong) and type of punishment (property-oriented, corporal, or psychological; aligned to the preceding transgression). Mediation models were the same as those in Experiment 1.
3.2. Results
Rating means per condition and for all DVs are displayed in Figure 4 and reported in Table 3.

Figure 4 Rating means per condition and DV and correlations between DVs (Experiment 2).
Note: (A) Perceived adequacy of the punishment. Four on the rating scale indicates appropriate adequacy, with lower values indicating too weak punishment and higher values indicating too strong punishment. (B) Perceived warmth of the punisher. (C) Perceived competence of the punisher. (D) The correlation matrix displays Pearson correlations for all five ratings averaged across conditions. (E) The hypothetical willingness to befriend the punisher. (F) The hypothetical willingness to be part of a team led by the punisher. Error bars indicate standard errors. ns p >= .05, * p < .05, ** p < .01, *** p < .001. Detailed violin plots displaying participant-level dispersion are provided in the Supplementary Figure S2.
Table 3 Rating means (M) and standard deviations (SD) for types of punishment by severities of punishment (Experiment 2)

3.2.1. Adequacy of punishment
In line with H4, we observed a main effect of punishment severity (F(1,47) = 138.10, p < .001, ηp 2 = .75) such that strong punishment was rated as less adequate (i.e., too strong) than weak punishment. Type of punishment reached significance (F(2,94) = 12.62, p < .001, ηp 2 = .21), indicating that property-oriented and corporal punishment, which were perceived similarly (t < 1), were both considered less adequate (i.e., too strong) than psychological punishment (t(47) ≥ 4.02, p < .001, d ≥ 0.58; see Table S4 in the Supplementary Material). As reflected in the significant severity × punishment type interaction (F(2,94) = 5.86, p = .004, ηp 2 = .11), this pattern differed depending on punishment severity: Property-oriented punishments were only perceived as less adequate than psychological acts when severity was strong (t(47) = 5.07, p < .001, d = 0.73) but not when it was weak (t(47) = 1.44, p = .268, d = 0.21).
3.2.2. Warmth of the punisher
Evaluations of the punisher’s warmth differed depending on punishment severity (F(1,47) = 103.51, p < .001, ηp 2 = .69), demonstrating that those who punished strongly were perceived as less warm than those who punished weakly. Type of punishment also had a significant impact (F(2,94) = 31.95, p < .001, ηp 2 = .41, ε = .88; GG-corrected). Specifically, property-oriented and corporal punishers were rated lower than psychological punishers (t(47) ≥ 5.25, p < .001, d ≥ 0.76), while the difference between property-oriented and corporal punishers was not significant (t(47) = 1.90, p = .064, d = 0.27; see Table S4 in the Supplementary Material for details). Findings support H2b, especially regarding the contrast between property-oriented and psychological punishment. Contrary to expectations, we found no significant severity × punishment type interaction (F < 1).
3.2.3. Competence of the punisher
As for adequacy and warmth, we found a main effect of punishment severity on the perceived competence of the punisher (F(1,47) = 24.17, p < .001, ηp 2 = .34). Strong punishers were viewed as less competent than weak punishers. Additionally, type of punishment impacted competence evaluations (F(2,94) = 21.40, p < .001, ηp 2 = .31, ε = .88; GG-corrected). Pairwise comparisons (see Table S4 in the Supplementary Material) yielded significantly lower competence ratings for property-oriented than for corporal and psychological punishers (t(47) ≥ 2.09, p ≤ .042, d ≥ 0.30), and for corporal than for psychological punishers (t(47) = 4.39, p < .001, d = 0.63), further underscoring H2b. As for perceived warmth, the severity × punishment type interaction was not significant (F(2,94) = 1.27, p = .285, ηp 2 = .03).
3.2.4. Tendency to accept the punisher as a friend
Once more, severity of the punishment affected punisher evaluations (F(1,47) = 145.42, p < .001, ηp 2 = .76), as strong punishers were less preferred as friends than weak punishers. Participants took different punishing types into account when evaluating their willingness to befriend the punisher (F(2,94) = 34.85, p < .001, ηp 2 = .43). Property-oriented and corporal punishers were less likely to be sought as friends than psychological punishers (t(47) ≥ 6.53, p < .001, d ≥ 0.94; see Table S4 in the Supplementary Material). The difference between property-oriented and corporal punishers was not significant (t(47) = 1.80, p = .079, d = 0.26). This pattern was similar for both severity levels, as indicated by the absence of interaction (F(2,94) = 1.37, p = .259, ηp 2 = .03).
3.2.5. Tendency to accept the punisher as a team leader
The tendency to accept the punisher as a team leader varied with punishment severity (F(1,47) = 102.97, p < .001, ηp 2 = .69), being lower for strong punishers than for weak punishers. Additionally, we found a main effect of type of punishment (F(2,94) = 33.02, p < .001, ηp 2 = .41), with participants being less willing to be led by property-oriented or corporal than by psychological punishers (t(47) ≥ 5.98, p < .001, d ≥ 0.86; see Table S4 in the Supplementary Material). The difference between property-oriented and corporal punishers was not significant (t(47) = 1.66, p = .104, d = 0.24). As before and opposed to what we initially expected, we found no severity × punishment type interaction (F < 1).
3.2.6. Mediation analyses
Analogous to Experiment 1, we conducted two mediation analyses (Figure 5). When controlling for perceived warmth, participants still preferred psychological over property-oriented punishers as friends, albeit to a smaller degree, suggesting partial mediation (see panel A of Figure 5). Similarly, participants were more willing to be led by psychological than by property-oriented punishers, even when controlling for perceived competence (see panel B of Figure 5). In general, strong effects of warmth on friendship decisions and competence on leadership preferences emphasize the importance of these dimensions in guiding interaction inclinations, further underlined by robust correlations between warmth and friendship (r(1150) = .81, p < .001) and between competence and leadership (r(1150) = .67, p < .001; see Figure 4D). Note, however, that ascriptions of warmth also correlated with choosing punishers as leaders, and ascriptions of competence with choosing them as friends.

Figure 5 Results of mediation analyses for the effect of type of punishment on interaction tendencies mediated by perceptions of the punisher (Experiment 2).
Note: prop = property-oriented, psych = psychological. c’ denotes the direct effect of type of punishment on the interaction tendency, and c denotes the total effect c’ + a * b. (A) The perceived warmth of the punisher was revealed as a partial mediator for the willingness to befriend the punisher. (B) The perceived competence of the punisher was revealed as a partial mediator for the willingness to accept the punisher as a team leader.
3.3. Discussion
Third parties exerting strong compared to weak punishment were evaluated more negatively across all rating dimensions, aligning with Hypothesis 4 (e.g., Eriksson et al., Reference Eriksson, Andersson and Strimling2016; Lee and Warneken, Reference Lee and Warneken2020; Liu et al., Reference Liu, Yang and Wu2021). Our findings are in line with type-specific studies examining punishment severity. For instance, mild corporal punishment was seen as regrettable but tolerable, while severe corporal punishment was viewed as abusive and morally incompetent on the punisher’s part (Brown et al., Reference Brown, Holden and Ashraf2018). Similarly, research contrasting weak vs. strong property-oriented punishment (taking a couple versus many items away) found that older children disapproved of harsh punishment and preferred moderate, lenient interventions (Solomon and Lee, Reference Solomon and Lee2025). For psychological punishment, strong responses (e.g., suspension instead of a warning for minor infractions) were perceived as excessive and unwarranted (Peterson, Reference Peterson2024), and harsh instead of mild verbal reprimands eroded trust in the punisher, particularly by diminishing perceived benevolence and integrity (Zhang and Qi, Reference Zhang and Qi2024). None of these studies had included different punishment types, leaving open any potential interactions of type and severity of punitive acts.
Notably, our study demonstrated that the difference between weak and strong punishment evaluations remained consistent across all three punishment types in almost all dependent measures (Hypothesis 5). An exploratory visual examination, however, suggested that harsh corporal punishment garnered the numerically strongest disapproval in learning and workplace settings (see Text S3 and Table S5 in the Supplementary Material). Though only a preliminary numerical trend, this pattern may reflect legal and ethical codes that especially take effect in professional environments (Peterson, Reference Peterson2024), where concerns about power abuse or abusive supervision are particularly prominent (Tepper, Reference Tepper2007).
Finally, we replicated Experiment 1 findings regarding type judgments, with psychological punishment preferred over property-oriented punishment (Hypothesis 2b). Corporal punishment was rated between these two types in weak punishment scenarios but received similarly low evaluations as property-oriented punishment in strong punishment scenarios, where the salience of the sanctions’ harshness may have overshadowed any further differentiation between types (except for the consistently positive perception of psychological sanctions).
4. General discussion
Punishment is a tricky business: While it stabilizes cooperation in groups, it can tarnish the punisher’s reputation or, in worst-case scenarios, trigger feuds. To better understand punishment appraisal, this study manipulated key characteristics of TPP and investigated their effect on the evaluation of punishment adequacy, the punisher’s traits in terms of warmth and competence, and the observer’s hypothetical willingness to engage with the punisher in the future.
All punishment characteristics implemented in our study shaped observers’ evaluations. Participants preferred psychological over corporal and property-oriented sanctions (Hypothesis 2b; Experiments 1 and 2; punishment type) and weaker over stronger punishments (Hypothesis 4; Experiment 2; punishment severity). Proportionality mattered for the ratings concerning the punisher’s warmth and competence, and the willingness to interact with the punisher in the future, with punishments aligning to the type of preceding transgression rated more positively (Hypothesis 3; Experiment 1; transgression type). Interestingly, punitive interventions were not perceived negatively per se. Beyond the experimental manipulations, they were generally viewed as appropriate, with punishers judged moderately warm and competent, and participants expressing some inclination to interact with them.
Across experiments, we tested two competing hypotheses regarding punishment type. First, property-oriented sanctions may receive better evaluations due to their established use in the legal system, their focus on targeting a person’s belongings rather than their body or psyche (Guala, Reference Guala2012; Schoenmakers et al., Reference Schoenmakers, Hilbe, Blasius and Traulsen2014), and their perception as rational and predictable, offering clear incentives against future misconduct (Molho et al., Reference Molho, Twardawski and Fan2022). The counterhypothesis conjectured that less institutionalized person-oriented corporal or psychological punishments would be judged superior in everyday interactions (Balliet et al., Reference Balliet, Molho, Columbus and Dores Cruz2022; Molho et al., Reference Molho, Tybur, Van Lange and Balliet2020; Nelissen and Mulder, Reference Nelissen and Mulder2013). Our findings consistently demonstrated a preference for psychological punishment, like verbal disapproval (even if formulated harshly) or suggestions for future comportment, which can promote trust in the third party’s behavior, demonstrate personal investment and consideration, and produce long-lasting effects (Kupfer and Tybur, Reference Kupfer and Tybur2023; Philippsen et al., Reference Philippsen, Mieth, Buchner and Bell2023). Critically, while previous studies showed the efficacy of psychological sanctions when combined with financial penalties (Chen and Xu, Reference Chen and Xu2020; Nelissen and Mulder, Reference Nelissen and Mulder2013), our findings suggest that they are equally valued as stand-alone punishment. In addition, observers may view psychological punishment as most sustainable for group well-being, avoiding both resource depletion as financial punishment (Dreber et al., Reference Dreber, Rand, Fudenberg and Nowak2008; Wu et al., Reference Wu, Balliet and Van Lange2016) and threats to physical integrity of group members as corporal punishment (Eriksson et al., Reference Eriksson, Andersson and Strimling2016). The less favorable ratings of property-oriented sanctions may partly reflect expectations that especially these penalties should be imposed by legal agencies (Eriksson et al., Reference Eriksson, Strimling and Ehn2013; Martin et al., Reference Martin, Jordan, Rand and Cushman2019; Raihani and Bshary, Reference Raihani and Bshary2019) rather than by unauthorized, equal-status peers as in our scenarios (Gordon et al., Reference Gordon, Madden and Lea2014; Guala, Reference Guala2012; Schoenmakers et al., Reference Schoenmakers, Hilbe, Blasius and Traulsen2014). Our flexible vignette paradigm offers an ideal platform to test these dynamics further. Because punishment (as well as reward) is a backbone of sustaining large-scale cooperation, understanding how both public authority figures and lay third parties can gain a better reputation and legitimacy across different punishment types is crucial (Tyler et al., Reference Tyler, Goff and MacCoun2015). Importantly, our within-subjects design (in contrast to, e.g., Eriksson et al., Reference Eriksson, Andersson and Strimling2016; Martin et al., Reference Martin, Jordan, Rand and Cushman2019) enabled us to carve out differences between psychological, corporal, and property-oriented punishment while accounting for participants’ baseline attitudes. This design choice, along with the large sample size, supports the credibility of our results.
Research shows that participants infer punishers’ personality traits based on the perceived appropriateness of their actions. Those who administer fair and proportionate sanctions are ascribed higher trustworthiness, warmth, and competence (e.g., Barclay, Reference Barclay2006; Jordan et al., Reference Jordan, Hoffman, Bloom and Rand2016; Raihani and Bshary, Reference Raihani and Bshary2015), and less irrationality (Lee and Warneken, Reference Lee and Warneken2020; Liu et al., Reference Liu, Yang and Wu2021). Our experiments extend these findings, revealing a preference for punishment types that match the preceding transgression type, judging these interventions as more appropriate and rating the respective punishers as especially warm and competent (Hofmann et al., Reference Hofmann, Brandt, Wisneski, Rockenbach and Skitka2018; Sznycer and Patrick, Reference Sznycer and Patrick2020). When punishments diverged from the transgression’s severity, milder sanctions were favored over harsher ones (Hypothesis 4). Notably, our study is among the first to investigate the impact of punishment severity across three distinct punishment types, indicating that both type and severity independently shape judgments. Severe punishments consistently led to greater disapproval, regardless of their specific form (contradicting Hypothesis 5).
Research on the signaling value of TPP suggests that intervening third parties can gain social benefits, like access to new social partners (e.g., Barclay, Reference Barclay2006; Jordan et al., Reference Jordan, Hoffman, Bloom and Rand2016; Santos et al., Reference Santos, Rankin and Wedekind2011). Our findings contribute to this research by demonstrating (i) how core characteristics of punishment scenarios shape participants’ propensity for future interactions with punishers and (ii) how punisher evaluations drive these interactions. Participants showed the highest interest in interacting with punishers when punishments were psychological (particularly following psychological transgressions) and mild. Additionally, mediation analyses revealed that inclinations for future interactions were largely driven by punisher evaluations. Specifically, perceived warmth fully (Experiment 1) or partially (Experiment 2) mediated the effect of punishment type (property-oriented vs. psychological) on hypothetical friendship, suggesting that social sanctions conveyed warmth and benevolence, which in turn increased affiliation (Cuddy et al., Reference Cuddy, Fiske and Glick2008). Perceived competence partially mediated the effect on hypothetical leadership in both experiments. The fact that punishment type retained a direct impact on leadership acceptance beyond perceived competence implies that additional criteria matter—potentially detrimental decisions may carry greater weight in leadership than friendship contexts (Dong et al., Reference Dong, Van Prooijen and Van Lange2022). Altogether, while punishment type still exhibited direct effects, punishers’ inferred personal qualities clearly shaped social interaction tendencies (in line with the stereotype content model, Cuddy et al., Reference Cuddy, Glick and Beninger2011; Fiske et al., Reference Fiske, Cuddy and Glick2007). It needs to be noted that correlational analyses suggested a high overlap between the dimensions warmth, competence, potential friendship, and potential leadership. This likely reflects both methodological decisions and conceptual overlap. Methodologically, the similar response formats for the DVs and the close succession of the ratings may have encouraged participants to provide relatively consistent judgments. Furthermore, research indicates that warmth and competence are often co-attributed in moral evaluations, particularly when little information is available, as in our vignettes (Abele and Wojciszke, Reference Abele and Wojciszke2007; Fiske et al., Reference Fiske, Cuddy and Glick2007).
Building on vignette-based approaches (e.g., Gordon et al., Reference Gordon, Madden and Lea2014; Lee and Warneken, Reference Lee and Warneken2020), we employed an exceptionally wide range of scenarios with diverse punishment strategies. Realistic vignettes about familiar transgressions and punishments likely evoke stronger emotional processing (Martin et al., Reference Martin, Jordan, Rand and Cushman2019) and are more relatable than abstract economic game settings involving monetary punishments in the range of a few cents (Guala, Reference Guala2012). Evaluations were remarkably consistent: Preferences for psychological sanctions remained stable across learning, work, everyday life, and social relationship contexts involving all age groups. Within punishment types, we utilized a comprehensive range of prevalent strategies, such as reprimand, gossip, blame, public shaming, or social exclusion for psychological punishments (see, e.g., Balliet et al., Reference Balliet, Molho, Columbus and Dores Cruz2022). Post-hoc inspections of individual vignette evaluations provided first insights into which specific punitive strategies were (dis)favored (see Text S2 and Text S3 in the Supplementary Material for details and additional analyses). Although our analysis of subtypes was exploratory, future studies can adjust the repertoire of vignettes to systematically investigate differences within punishment types, e.g., by statistically comparing the evaluation of non-confrontational indirect vs. confrontational direct approaches.
4.1. Limitations
Participants assumed an observer’s perspective in scenarios delineated in written vignettes. While hypothetical responses cannot fully reflect real-life reactions to actual events (e.g., Carlsmith, Reference Carlsmith2006; Cui et al., Reference Cui, Wang, Cao and Jiao2019; FeldmanHall et al., Reference FeldmanHall, Dalgleish, Thompson, Evans, Schweizer and Mobbs2012)—for instance, third parties exhibit greater leniency toward harm in hypothetical situations due to the lack of personal involvement and consequences (Bostyn et al., Reference Bostyn, Sevenhant and Roets2018)—our goal was not to measure participants’ own punitive behavior but rather to examine how observers evaluate punishment strategies enacted by others. Further, to foster imagery and immersion in the situation, we employed vignettes enriched with detailed social information (see Evans et al., Reference Evans, Roberts, Keeley, Blossom, Amaro, Garcia, Stough, Canter, Robles and Reed2015). We nonetheless acknowledge the constraint that judgments of hypothetical situations may differ from evaluations of real observations.
Next, we relied on two non-representative samples mainly composed of German neurotypical, often female students with an interest in psychological research, who may differ from the general population in decision-making and moral motivation (Cappelen et al., Reference Cappelen, Nygaard, Sørensen and Tungodden2015). For instance, prior work suggests that women display differences in their sensitivity to harm or preference for prosocial punishment (Kamas and Preston, Reference Kamas and Preston2021). Additionally, support for certain sanctions can vary between individualistic and collectivistic cultures; for example, while physical confrontation and ostracism were more accepted in collectivistic, high power-distance societies, more emancipative societies favored gossip and disapproved of punishment more strongly (Eriksson et al., Reference Eriksson, Strimling, Gelfand, Wu, Abernathy, Akotia, Aldashev, Andersson, Andrighetto, Anum, Arikan, Aycan, Bagherian, Barrera, Basnight-Brown, Batkeyev, Belaus, Berezina, Björnstjerna and Van Lange2021). However, given that core moral values and social norms appear globally widespread with only minor regional variation (Alfano et al., Reference Alfano, Cheong and Curry2024), the observed support for psychological, proportionate, and mild punishments is unlikely to stem solely from sample characteristics. Still, findings should be interpreted with caution, and future research needs to replicate results in larger and representative samples.
Although we instructed participants that punishments were not intended to benefit the third party, this impression may have still contributed to more negative ratings, especially for property-oriented interventions (e.g., Eriksson et al., Reference Eriksson, Andersson and Strimling2016; Krasnow et al., Reference Krasnow, Delton, Cosmides and Tooby2016; Redhead et al., Reference Redhead, Dhaliwal and Cheng2021). Future studies could explicitly manipulate benefits for the third party or include a control question asking whether participants interpreted the punishment as self-serving.
Additionally, our vignettes did not provide information on the effectiveness of the punishment or the reaction of the perpetrator to the punitive action. Given prior emphasis on the relevance of punishment outcomes (Funk et al., Reference Funk, McGeer and Gollwitzer2014), this aspect could be included and even manipulated in our vignettes in future studies. This way, we could probe whether and to what extent the observer’s judgment or satisfaction concerning the punitive action is enhanced, e.g., when the perpetrator demonstrates a change in attitude or behavior.
Finally, we did not include alternative action options for third parties, such as compensation (e.g., Batistoni et al., Reference Batistoni, Barclay and Raihani2022; Heffner and FeldmanHall, Reference Heffner and FeldmanHall2019; Li et al., Reference Li, Hu, Xu and Li2021) or non-acting (e.g., Dhaliwal et al., Reference Dhaliwal, Patil and Cushman2021; Martin et al., Reference Martin, Jordan, Rand and Cushman2019). Our results imply that, in the absence of other options, punishing defectors is accepted as a default response to unfairness (Gromet and Darley, Reference Gromet and Darley2009). Future research could include a non-acting third party, allowing comparisons with those who choose not to intervene.
5. Conclusion
Considering the contentious public perception of third-party interventions (Dhaliwal et al., Reference Dhaliwal, Patil and Cushman2021; Raihani and Bshary, Reference Raihani and Bshary2015), a closer approximation and better understanding of the complex situational factors shaping evaluations of TPP is essential. In this series of experiments, we employed vignettes depicting various hypothetical yet realistic interactions. Taken together, psychological punishments, ranging from verbal reprimands to temporary exclusion from activities, were favored by external observers while corporal and resource-based punitive measures lacked broad support.
Let us return to the initially introduced bystander witnessing an impatient commuter aggressively skipping the queue. Our bystander should carefully calibrate their intervention: To be seen as competent and warm, they should choose a similar response type or opt for the widely accepted psychological approach. Ideally, they should tailor the severity of their reaction to the seriousness of the transgression or choose a slightly milder response. Remaining impartial, the bystander should focus on communicating the norm violation, offering the perpetrator an opportunity to reflect and adjust future behavior.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/jdm.2025.10021.
Disclosure of use of AI tools
During the preparation of this work, the authors used ChatGPT-5 (OpenAI, 2025) in order to improve the readability and language of the manuscript. The authors reviewed and edited the content as needed and take full responsibility for the content of the published article.
Data availability statement
The data presented in this study are openly available on the OSF at https://osf.io/fr9bt/overview.
Acknowledgments
We would like to thank our student research assistants, Maria Goldkin, Anne Kirsch, and Emil Stein, for their support in developing the study materials and assisting with data collection.
Funding statement
This research received no specific grant funding from any funding agency, commercial or not-for-profit sectors.
Competing interest
The authors declare no competing interests.





