John et al. offer insights about proxy failure across a range of disciplines (see their Table 1). We argue that this proxy failure lens can also be fruitfully applied in psychological science, where construct validity serves as a proxy for the goal of measuring unobservable psychological phenomena. Validated measurements (i.e., scores on self-report questionnaires and tests of ability) then serve as proxies for higher-order goals, such as improving clinical outcomes.
The American Psychological Association guidelines state: “A sound validity argument integrates various strands of evidence into a coherent account of the degree to which existing evidence and theory support the intended interpretation of test scores for specific uses” (American Educational Research Association et al., 2014, p. 21). Further, because the validity of test scores can vary depending on the properties of the sample being examined, “[b]est practice is to estimate both reliability and validity, when possible, within the researcher's sample or samples” (Appelbaum et al., Reference Appelbaum, Cooper, Kline, Mayo-Wilson, Nezu and Rao2018, p. 9). Unfortunately, there is a growing body of evidence demonstrating that studies across psychological science routinely accept measurements as valid without sufficient validity evidence (Alexandrova & Haybron, Reference Alexandrova and Haybron2016; Flake, Pek, & Hehman, Reference Flake, Pek and Hehman2017; Higgins, Ross, Polito, & Kaplan, Reference Higgins, Ross, Polito and Kaplan2023b; Hussey & Hughes, Reference Hussey and Hughes2020; Shaw, Cloos, Luong, Elbaz, & Flake, Reference Shaw, Cloos, Luong, Elbaz and Flake2020; Slaney, Reference Slaney2017), including measurements used for important clinical applications such as diagnosing and treating depression (Fried, Flake, & Robinaugh, Reference Fried, Flake and Robinaugh2022).
Building on the work of John et al., we propose that the proxy failure framework highlights a key cause of inadequate construct validity evidence in psychological science: When test scores become targets, focus is diverted away from the relationship between test scores and the underlying psychological constructs they are intended to measure. This, we suggest, can result in divergence between test scores and psychological phenomena. For instance, tests are sometimes used in different populations without considering whether the relationship between test scores and psychological constructs holds across populations.
Consider the Reading the Mind in the Eyes Test (Eyes Test; Baron-Cohen, Wheelwright, Hill, Raste, & Plumb, Reference Baron-Cohen, Wheelwright, Hill, Raste and Plumb2001), which is widely used as a measure of social cognitive ability in samples drawn from many clinical and nonclinical populations and countries (Pavlova & Sokolov, Reference Pavlova and Sokolov2022). Despite the near universal practice of calculating a single sum score for the 36-item Eyes Test, there are two key pieces of evidence that the structural properties of Eyes Test scores vary across samples and that the interpretation of sum scores is not always supported. First, factor analysis studies spanning multiple language versions of the Eyes Test have reported poor unidimensional model fit (e.g., Dordevic et al., Reference Dordevic, Zivanovic, Pavlovic, Mihajlovic, Karlicic and Pavlovic2017; Higgins, Ross, Langdon, & Polito, Reference Higgins, Ross, Langdon and Polito2023a; Olderbak et al., Reference Olderbak, Wilhelm, Olaru, Geiger, Brenneman and Roberts2015; Redondo & Herrero-Fernández, Reference Redondo and Herrero-Fernández2018; Topić & Kovačević, Reference Topić and Kovačević2019), and it has even been found that the factor structure of Eyes Test scores for different ethnic and linguistic groups within the same country can vary (Van Staden & Callaghan, Reference Van Staden and Callaghan2021). Second, a recent meta-analysis identified substantial variation in the internal consistency estimates of Eyes Test scores across samples, with half falling below the level conventionally taken to be acceptable (Kittel, Olderbak, & Wilhelm, Reference Kittel, Olderbak and Wilhelm2022). Yet, Eyes Test sum scores are frequently compared between populations, with inferences drawn about relative levels of social cognitive ability.
An outstanding question when the proxy failure framework is applied to psychological science is why studies that fail to meet existing construct validity evidence reporting standards are routinely published (Flake & Fried, Reference Flake and Fried2020; Slaney, Reference Slaney2017). As John et al. note, a proxy must be simple enough for agents and regulators to identify and understand (i.e., must be legible), so that it can “feasibly be observed, rewarded, and pursued” (target article, sect. 3.2, para. 2). However, legibility can come at a cost to fidelity: “There is a natural human tendency to try to simplify problems by focusing on the most easily measurable elements. But what is most easily measured is rarely what is most important” (Muller, Reference Muller2018, p. 23). Although psychological research standards state that the construct validity proxy should be based on a variety of sources of validity evidence (American Educational Research Association et al., 2014; Clark & Watson, Reference Clark and Watson2019), some sources of evidence are more legible than others. We contend that the failure to enforce best practices in construct validation can be explained in part because of the prioritisation of legibility over fidelity. This results in less legible sources of validity evidence being overlooked in a phenomenon we refer to as “proxy pruning.” Unfortunately, proxy pruning can result in test scores being accepted as valid that might be deemed invalid if other sources of validity evidence were examined.
A key example of proxy pruning in psychological science is ignoring the importance of providing construct validity evidence that is derived from a psychological theory (Alexandrova & Haybron, Reference Alexandrova and Haybron2016; Bringmann, Elmer, & Eronen, Reference Bringmann, Elmer and Eronen2022; Eronen & Bringmann, Reference Eronen and Bringmann2021; Feest, Reference Feest2020). In particular, researchers often over-rely on the psychometric properties of test scores when establishing construct validity and avoid challenging theoretical questions about how to define psychological constructs (Alexandrova & Haybron, Reference Alexandrova and Haybron2016; Clark & Watson, Reference Clark and Watson2019). In addition to being important in their own right, these theoretical questions can be critical to interpreting psychometric properties. Consider the use of convergent validity evidence in the well-being literature where “it can seem as if nearly everything correlates substantially with nearly everything else” (Alexandrova & Haybron, Reference Alexandrova and Haybron2016, p. 1104). Some researchers have deemed that “better” measures of well-being are those that correlate more strongly with measures of life circumstances, including income, governance, and freedom. However, without having done the hard conceptual work of determining what precisely is meant by “well-being” (e.g., “happiness,” “life satisfaction,” “flourishing,” “preference satisfaction,” “quality of life”; Alexandrova & Singh, Reference Alexandrova, Singh, Newfield, Alexandrova and John2022) it remains unclear why correlating more strongly with these particular variables is indicative of a more valid measure of well-being.
In sum, John et al.'s proxy failure framework offers insights into poor measurement practices in psychological science. However, we argue that the problem with proxies is not only that they become targets, but that they are pruned down to be legible targets, which decreases their fidelity, leaving them more susceptible to proxy failure.
John et al. offer insights about proxy failure across a range of disciplines (see their Table 1). We argue that this proxy failure lens can also be fruitfully applied in psychological science, where construct validity serves as a proxy for the goal of measuring unobservable psychological phenomena. Validated measurements (i.e., scores on self-report questionnaires and tests of ability) then serve as proxies for higher-order goals, such as improving clinical outcomes.
The American Psychological Association guidelines state: “A sound validity argument integrates various strands of evidence into a coherent account of the degree to which existing evidence and theory support the intended interpretation of test scores for specific uses” (American Educational Research Association et al., 2014, p. 21). Further, because the validity of test scores can vary depending on the properties of the sample being examined, “[b]est practice is to estimate both reliability and validity, when possible, within the researcher's sample or samples” (Appelbaum et al., Reference Appelbaum, Cooper, Kline, Mayo-Wilson, Nezu and Rao2018, p. 9). Unfortunately, there is a growing body of evidence demonstrating that studies across psychological science routinely accept measurements as valid without sufficient validity evidence (Alexandrova & Haybron, Reference Alexandrova and Haybron2016; Flake, Pek, & Hehman, Reference Flake, Pek and Hehman2017; Higgins, Ross, Polito, & Kaplan, Reference Higgins, Ross, Polito and Kaplan2023b; Hussey & Hughes, Reference Hussey and Hughes2020; Shaw, Cloos, Luong, Elbaz, & Flake, Reference Shaw, Cloos, Luong, Elbaz and Flake2020; Slaney, Reference Slaney2017), including measurements used for important clinical applications such as diagnosing and treating depression (Fried, Flake, & Robinaugh, Reference Fried, Flake and Robinaugh2022).
Building on the work of John et al., we propose that the proxy failure framework highlights a key cause of inadequate construct validity evidence in psychological science: When test scores become targets, focus is diverted away from the relationship between test scores and the underlying psychological constructs they are intended to measure. This, we suggest, can result in divergence between test scores and psychological phenomena. For instance, tests are sometimes used in different populations without considering whether the relationship between test scores and psychological constructs holds across populations.
Consider the Reading the Mind in the Eyes Test (Eyes Test; Baron-Cohen, Wheelwright, Hill, Raste, & Plumb, Reference Baron-Cohen, Wheelwright, Hill, Raste and Plumb2001), which is widely used as a measure of social cognitive ability in samples drawn from many clinical and nonclinical populations and countries (Pavlova & Sokolov, Reference Pavlova and Sokolov2022). Despite the near universal practice of calculating a single sum score for the 36-item Eyes Test, there are two key pieces of evidence that the structural properties of Eyes Test scores vary across samples and that the interpretation of sum scores is not always supported. First, factor analysis studies spanning multiple language versions of the Eyes Test have reported poor unidimensional model fit (e.g., Dordevic et al., Reference Dordevic, Zivanovic, Pavlovic, Mihajlovic, Karlicic and Pavlovic2017; Higgins, Ross, Langdon, & Polito, Reference Higgins, Ross, Langdon and Polito2023a; Olderbak et al., Reference Olderbak, Wilhelm, Olaru, Geiger, Brenneman and Roberts2015; Redondo & Herrero-Fernández, Reference Redondo and Herrero-Fernández2018; Topić & Kovačević, Reference Topić and Kovačević2019), and it has even been found that the factor structure of Eyes Test scores for different ethnic and linguistic groups within the same country can vary (Van Staden & Callaghan, Reference Van Staden and Callaghan2021). Second, a recent meta-analysis identified substantial variation in the internal consistency estimates of Eyes Test scores across samples, with half falling below the level conventionally taken to be acceptable (Kittel, Olderbak, & Wilhelm, Reference Kittel, Olderbak and Wilhelm2022). Yet, Eyes Test sum scores are frequently compared between populations, with inferences drawn about relative levels of social cognitive ability.
An outstanding question when the proxy failure framework is applied to psychological science is why studies that fail to meet existing construct validity evidence reporting standards are routinely published (Flake & Fried, Reference Flake and Fried2020; Slaney, Reference Slaney2017). As John et al. note, a proxy must be simple enough for agents and regulators to identify and understand (i.e., must be legible), so that it can “feasibly be observed, rewarded, and pursued” (target article, sect. 3.2, para. 2). However, legibility can come at a cost to fidelity: “There is a natural human tendency to try to simplify problems by focusing on the most easily measurable elements. But what is most easily measured is rarely what is most important” (Muller, Reference Muller2018, p. 23). Although psychological research standards state that the construct validity proxy should be based on a variety of sources of validity evidence (American Educational Research Association et al., 2014; Clark & Watson, Reference Clark and Watson2019), some sources of evidence are more legible than others. We contend that the failure to enforce best practices in construct validation can be explained in part because of the prioritisation of legibility over fidelity. This results in less legible sources of validity evidence being overlooked in a phenomenon we refer to as “proxy pruning.” Unfortunately, proxy pruning can result in test scores being accepted as valid that might be deemed invalid if other sources of validity evidence were examined.
A key example of proxy pruning in psychological science is ignoring the importance of providing construct validity evidence that is derived from a psychological theory (Alexandrova & Haybron, Reference Alexandrova and Haybron2016; Bringmann, Elmer, & Eronen, Reference Bringmann, Elmer and Eronen2022; Eronen & Bringmann, Reference Eronen and Bringmann2021; Feest, Reference Feest2020). In particular, researchers often over-rely on the psychometric properties of test scores when establishing construct validity and avoid challenging theoretical questions about how to define psychological constructs (Alexandrova & Haybron, Reference Alexandrova and Haybron2016; Clark & Watson, Reference Clark and Watson2019). In addition to being important in their own right, these theoretical questions can be critical to interpreting psychometric properties. Consider the use of convergent validity evidence in the well-being literature where “it can seem as if nearly everything correlates substantially with nearly everything else” (Alexandrova & Haybron, Reference Alexandrova and Haybron2016, p. 1104). Some researchers have deemed that “better” measures of well-being are those that correlate more strongly with measures of life circumstances, including income, governance, and freedom. However, without having done the hard conceptual work of determining what precisely is meant by “well-being” (e.g., “happiness,” “life satisfaction,” “flourishing,” “preference satisfaction,” “quality of life”; Alexandrova & Singh, Reference Alexandrova, Singh, Newfield, Alexandrova and John2022) it remains unclear why correlating more strongly with these particular variables is indicative of a more valid measure of well-being.
In sum, John et al.'s proxy failure framework offers insights into poor measurement practices in psychological science. However, we argue that the problem with proxies is not only that they become targets, but that they are pruned down to be legible targets, which decreases their fidelity, leaving them more susceptible to proxy failure.
Financial support
This work was supported by an Australian Government Research Training Program (RTP) Scholarship (W. C. H.), a Macquarie University Research Excellence Scholarship (W. C. H.), and the John Templeton Foundation (R. M. R., grant number 62631; A. J. G., grant number 61924).
Competing interest
None.