Hostname: page-component-5b777bbd6c-f9nfp Total loading time: 0 Render date: 2025-06-24T10:48:08.563Z Has data issue: false hasContentIssue false

Resolving the Reference Class Problem at Scale

Published online by Cambridge University Press:  14 April 2025

Aaron Roth*
Affiliation:
University of Pennsylvania
Alexander Williams Tolbert*
Affiliation:
Emory University
*
Corresponding authors: Aaron Roth; Email: aaroth@gmail.com and Alexander Williams Tolbert; Email: alexander.tolbert@emory.edu
Corresponding authors: Aaron Roth; Email: aaroth@gmail.com and Alexander Williams Tolbert; Email: alexander.tolbert@emory.edu
Rights & Permissions [Opens in a new window]

Abstract

We draw a distinction between the traditional reference class problem, which describes an obstruction to estimating a single individual probability—which we rename the individual reference class problem—and what we call the reference class problem at scale, which can result when using tools from statistics and machine learning to systematically make predictions about many individual probabilities simultaneously. We argue that scale actually helps to mitigate the reference class problem, and purely statistical tools can be used to efficiently minimize the reference class problem at scale, even though they cannot be used to solve the individual reference class problem.

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of the Philosophy of Science Association

1 Introduction

Statistical inference is fundamental to data-driven decision making. However, this process has foundational challenges, including the reference class problem (Reichenbach Reference Reichenbach1971; see, e.g., Hájek [Reference Hájek2007] for a modern treatment). The reference class problem refers to the general issue of determining the appropriate reference class over which to estimate probabilities. It comes about from the fact that finite amounts of data do not provide us access to “individual probabilities” (see Dawid Reference Dawid2017)—we can only estimate aggregates over sufficiently large groupings of the data, or reference classes— and the decision of which reference class to aggregate over can dramatically change what our estimates are. Yet there is not, in general, any way to uniquely choose a reference class using only finite amounts of data.Footnote 1

We make a new distinction between different manifestations of the reference class problem. We call the type of reference class problem that traditionally has been studied the individual reference class problem. This is the reference class problem as it arises in the context of making a single probabilistic forecast about an individual—for example, in the context of life insurance, whether a particular individual will die within the next 12 months, or in the context of criminal justice, whether a particular inmate will go on to commit another crime if released on parole.

As these examples might already bring to mind, however, forecasts of this sort are rarely made exactly once; rather, they are often systematically and repeatedly made across many individuals. Within the domains of insurance, criminal justice, medicine, technology, and so forth, increasingly sophisticated statistical and machine learning methods are used to make probabilistic predictions about individuals at scale. A natural concern is that this will correspondingly lead to the reference class problem at scale—that the arbitrariness and indeterminacy that the individual reference class problem injects into individual prediction will grow with the scale with which we are now making inferences and lead to indeterminacy among models, which will result in widespread arbitrary decision making. This concern has been highlighted in the computer science and “algorithmic fairness” literature under the name of the predictive multiplicity or model multiplicity problem (see, e.g., Marx et al. Reference Marx, Calmon, Ustun, Daumé and Singh2020; Black et al. Reference Black, Raghavan and Barocas2022).

Our thesis is that although the individual reference class problem poses significant challenges in justifying predictions about individuals, scale in fact serves to mitigate rather than propagate the problem. When making predictions about many people, there is a way to resolve the reference class problem at scale: the area of disagreement between two models can itself be used to constructively falsify and improve at least one of the models. The consequence of this is that we can never find ourselves in a situation in which we have two competing “equally well-justified” models that make significantly different predictions about a significant fraction of the population. This is in contrast to the individual prediction problem, in which our inability to uniquely pick a reference class can put us in the position of having two “equally well-justified” but mutually incompatible estimates of an individual probability. Thus, the practical consequences of the reference class problem diminish at scale, although it remains a problem for individual predictions.

More generally, one can worry about a statistical evidence paradox that might arise when two different models, $f$ and $f{\rm{'}}$ , each making probabilistic predictions about some predicate, are equally consistent with the data across all considered statistical tests, including, for example, measures of accuracy and bias across different data divisions, and yet the two models produce very different predictions. If this paradox arises, it highlights the limitations of traditional statistical evidence because the evidence is consistent with widely varying predictions. Here we highlight multicalibration (Hébert-Johnson et al. Reference Hébert-Johnson, Kim, Reingold, Rothblum, Dy and Krause2018; see also its generalization to “outcome indistinguishability” [Dwork et al. Reference Dwork, Kim, Reingold, Rothblum and Yona2021]), a recent framework emerging from the computer science literature on algorithmic fairness, as a way to think about and simultaneously enforce the consistency of a model with statistical evidence coming from arbitrary collections of reference classes. Multicalibration is intellectually closely related to Dawid’s notion of computable calibration (Dawid Reference Dawid1985) and the collectives-based formalization of randomness originating with Von Mises (Reference Von Mises1981) and Martin-Löf (Reference Martin-Löf1966). Through this connection, it also bears a resemblance to classical ways of thinking about the reference class problem, such as Salmon’s “objectively homogeneous reference classes” (Salmon Reference Salmon1977). However, it differs in that it focuses on those reference classes that can be identified with bounded amounts of computation and data. This feature makes it actionable in finite-data settings that arise in practice. Although multicalibration (even with respect to all computationally identifiable reference classes) generally does not imply uniqueness of predictions and so does not resolve statistical evidence paradoxes, any pair of models that witnesses a statistical evidence paradox witnesses a reference class on which one of the two models must be inconsistent, deriving from the region of disagreement of the two models. Cross-calibration, in the style of Roth et al. (Reference Roth, Tolbert and Weinstein2023), corresponds to enforcing multicalibration on reference classes that derive from these disagreement regions, which has the effect of eliminating them and resolving statistical evidence paradoxes.

2 The individual reference class problem

The reference class problem is a fundamental issue in statistical inference (Venn Reference Venn1866; Reichenbach Reference Reichenbach1971; Hájek Reference Hájek2007). The problem arises when an event or individual can be classified in multiple ways, each leading to a different probability assessment. Hájek (Reference Hájek2007) gives a variant of the following definition, which we modify here to provide formalism that will be useful to our subsequent discussion.

Definition 2.1 Fix a universe of elements $e \in {\cal X}$ (corresponding to, e.g., records pertaining to people) and a distribution ${\cal D}$ over these elements. Fix a binary predicate Footnote 2 $F:{\cal X} \to \left\{ {0,1} \right\}$ , as well as a collection of subsets of the universe of elements ${S_1}, \ldots, {S_n} \subseteq {\cal X}$ called reference classes. Suppose our access to the underlying probability distribution is limited to sampling from it, and so we can measure conditional probabilities $Pr[F\left( e \right) = 1|e \in {S_i}]$ for sufficiently large reference classes, but we cannot directly access “individual probabilities” $Pr[F\left( e \right) = 1|e]$ for fixed $e \in {\cal X}$ (and we might not even make intellectual commitments that posit that these are coherent Footnote 3 ). Then the individual reference class problem arises for an individual element $e \in {\cal X}$ if there are multiple incomparable reference classes ${S_i} \ne {S_j}$ such that neither ${S_i} \subset {S_j}$ nor ${S_j} \subset {S_i}$ , $e \in {S_i}$ , $e \in {S_j}$ , but $Pr\left[ {F\left( e \right) = 1\left| {e \in {S_i}\left] { \ne Pr} \right[F\left( e \right) = 1} \right|e \in {S_j}} \right]$ .

When the reference class problem arises, there is no unique way to assign $F\left( e \right)$ a probability by estimating a conditional probability conditional on a reference class ${S_i}$ .

The reference class problem arises when assigning an “individual probability” to an event when the specific event has only been observed to happen once—or perhaps can only happen once, even in principle. Examples of such individual probabilities include the probability that it will rain tomorrow and the probability that a particular individual will die within the next 12 months. Probabilities for such events cannot be unambiguously estimated from data; in practice, we assign the event to an appropriate grouping of “similar” instances in the data, estimate the prevalence of the “similar” events over the grouping, and then impute this estimated probability to the individual in question. For example, in a life insurance scenario, to estimate the probability that a particular individual “Alice” will die within the next 12 months, we might group individuals whom we have insured in the past who share demographic similarities—for example, are of similar age, gender, and weight and have similar medical conditions—and estimate the proportion of people in this grouping who have died within a 1-year period. We will then interpret this proportion as the “probability” that Alice will die within the next 12 months. Training statistical or machine learning models ultimately amounts to finding such groupings automatically and does not avoid the problem.Footnote 4 There is a trade-off in how we form this grouping or “reference class.” If we insist that the people in the reference class be identical to Alice in every way, we will find that it is empty. So, we must include people who deviate from Alice in a variety of ways in order to have enough samples of data to solve the statistical estimation problem of estimating the 1-year mortality rate within this reference class. But this gives us many degrees of freedom. There are different ways to group people who are “like” Alice in various different ways, and different groupings of people into different reference classes will result in different estimated mortality rates. We can try to justify various choices of reference classes with a mechanistic theory of the world, but ultimately, any such model must be validated on data, and we may not have unambiguous ways to choose between models (which implicitly assign individuals to reference classes) that are equally predictive.

The problem of the reference class arises when there are various overlapping categories that an individual may potentially belong to. To illustrate this concern, we examine two cases: the “Gatecrashers” case and the Cosmos Club case.

2.1 The Gatecrashers case

The Gatecrashers scenario was introduced by Cohen (Reference Cohen1977) and involves a rodeo event in which some of the attendees failed to pay for admissions—but for any particular attendee, we only have statistical evidence.

We quote here a variant due to Bolinger (Reference Bolinger2020). In the variant, among the 1000 people who have attended, only 10 paid admission. We also know that 10 of the attendees are Boy Scouts, and among this group, 9 of them paid; additionally, 3 of the attendees are Canadians, and among this group, 2 of them paid. Now we are tasked with evaluating the probability that Alfred paid:

Alfred is a Canadian former boy scout in F [an attendee at the rodeo]. Let Ga be the proposition “Alfred failed to pay.” Conditional on being in F, P(Ga) = .99; conditional on Alfred’s being a former boy scout, P(Ga) = .1; conditional on Alfred’s being Canadian, P(Ga) = .33. With this as evidence, what credence should a rational agent have in Ga? Presumably in asking this question, we think that credences should be substantially constrained by one’s evidence; the problem is that the evidence is not univocal. It supports multiple, competing probability assignments, depending on which reference class we attend to. To identify what credence is rational, we’ll first need to determine whether being a member of F, or former boy scout, or Canadian, is most relevant to determining whether Alfred failed to pay. Bolinger (Reference Bolinger2020), 2420.

The difficulty here is that we do not have any data for the most specific reference classFootnote 5 that we could place Alfred into—Canadian former Boy Scouts (because Alfred is the only one). We can estimate averages over Canadians or over Boy Scouts, but those estimates are different, and on what basis are we to say that one reference class is more salient to the task at hand than another? This example demonstrates how the choice of reference class can lead to different probability assessments, which could significantly affect the conclusion about Alfred’s payment status. We note in passing that Bollinger seems to be making the assumption that credences should correspond to (frequentist) probabilities here. This is a view that we are sympathetic with but not one that is universally agreed upon.

2.2 The Cosmos Club case

The reference class problem is also a common source of stereotyping and racial bias. Racial stereotyping can arise when people make inferences by inappropriately using race to define a reference class. Historian John Hope Franklin recounts an incident at his Washington, D.C., social club, the Cosmos Club, that illustrates this:

It was during our stroll through the club that a white woman called me out, presented me with her coat check, and ordered me to bring her coat. I patiently told her that if she would present her coat to a uniformed attendant, “and all of the club attendants were in uniform,” perhaps she could get her coat. Franklin (Reference Franklin2005), 4, 340

At the Cosmos Club, the majority of the attendants were black, and there were few black members. This demographic distribution likely led to the woman’s mistaken belief that Franklin, who was a black member of the club, was an attendant.

This case is an example of the reference class problem because the woman incorrectly used the race of the club’s attendants as a reference class to make a prediction about Franklin’s role at the club. She estimated the probability of a person being a staff member, given their race, to be high. If race was the only distinguishing feature, then this might have been a statistically reasonable inference. But in this case, there were other distinguishing features. A more appropriate reference class would have been whether a person is wearing a uniform or dinner attire. This would have led to a more accurate inference (Bolinger, Reference Bolinger2020; Gardiner, Reference Gardiner and McCaine2018).

In these cases, the reference class problem comes about because the same event can be classified differently based on the available information. This leads to different probabilities. This issue has significant implications in legal cases, where the determination of guilt or liability often depends on the interpretation of probabilistic evidence (Rhee, Reference Rhee2007). These are all instances of what we call the individual reference class problem, in that the object of interest is always a property or outcome of a single, distinguished individual.

In the next section, we will define the reference class problem at scale, in which the object of interest is a mapping from evidence to predictions that can be applied to many individuals. Such an object is the outcome, for example, of fitting a statistical model.

3 The reference class problem at scale

3.1 Defining the reference class problem at scale

The reference class problem at scale arises when we systematically make many predictions of individual probabilities. To explore this, we need to study prediction at scale. In an informal sense, prediction at scale is conducted using statistical or machine learning models, which are capable of automatically generating predictions about any individual case given descriptive features about that case. These technologies, therefore, implicitly commit to many predictions about individuals, each potentially subject to the individual reference class problem.

Consider a health-care scenario in which a hospital uses machine learning models to predict patient outcomes—say, the likelihood of hospital readmission—for thousands of patients. Such a model would take as input patient records and output for each patient a number between 0 and 1, purporting to be the “probability” that the patient will be readmitted to the hospital within (say) 12 weeks of discharge. Each prediction is of an “individual probability,” and so depending on the reference class chosen, multiple predictions might be justified as reasonable for each person. The model, however, chooses only one prediction per person. So, one might worry that this indeterminacy of multiple reasonable individual predictions would accumulate into an indeterminacy among multiple reasonable models, each consistent with the evidence but making very different predictions across a large number of people. Can we, for example, have two equally accurate models, neither of which is statistically falsifiedFootnote 6 by any hypothesis test that we can devise, that nevertheless make very different predictions about a large number of people?

The possibility of encountering the reference class problem at scale also poses an ethical consideration: If there are multiple very different models equally consistent with the data, on what basis can we justify choosing any one of them to make decisions with important implications about people? To continue our hospital readmission example, once the health-care system is in possession of the predictions of its model, it might use them to distribute scarce and valuable resources. For example, it might assign patients with the highest predicted probability of readmission-free home visitations by nursing staff. But if two equally well-justified predictive models make very different predictions, then we would also have two methods for allocating scarce and valuable resources to a vulnerable population that result in very different outcomes for many individuals—on what basis can we justify choosing one of the methods over another? To summarize, the individual reference class problem makes predictions about individuals necessarily arbitrary in a certain sense (at least in the practical case in which we must learn from only finite data); is there a way to escape this arbitrariness when making predictions at scale?

Given a universe of elements ${\cal X}$ , a model $f:{\cal X} \to \left[ {0,1} \right]$ assigns values $f\left( e \right)$ to elements $e \in {\cal X}$ that purport to be individual probabilities for some predicate $F$ ; that is, the model purports that $f\left( e \right) = {\rm{Pr}}[F\left( e \right)|e]$ (“The probability that Alice dies within the next 12 months”). Of course, we should be skeptical of such claims and seek to validate or statistically falsify them based on data. Given samples from the distribution and a reference class $S \subset {\cal X}$ that has nontrivial mass under the distribution, we can estimate conditional expectations ${\mathbb{E}_{}}[F\left( e \right)|e \in S]$ . Therefore, we can falsify a purported model of individual probability if it significantly deviates from consistency with respect to a reference class. Informally, a model is $\epsilon $ -consistent with respect to a reference class $f$ if, when we average the model’s predictions over $e$ in the reference class, we get the same value (up to error $\epsilon $ ) as we do when we average the actual observed outcomes over the reference class. True individual probabilities $f\left( e \right) = {\mathbb{E}_{}}[F\left( e \right)|e]$ would satisfy this property. So, a failure to be consistent with respect to a reference class $S$ exhibits an inconsistency of the model with respect to the statistical evidence before us.

Definition 3.1 A model of individual probabilities $f:{\cal X} \to \left[ {0,1} \right]$ is $\epsilon $ -consistent with a reference class $S \subseteq {\cal X}$ on a predicate $F$ if

$${\mathop {\mathbb{E}}\limits_{} [ {f\left( e \right) | {e \in S] {{ \approx _\epsilon }\mathop {\mathbb{E}}\limits_{} } [F\left( e \right)} |e \in S} ],}$$

where $a{ \approx _\epsilon }b$ if $\left| {a - b} \right| \le \epsilon $ . A reasonable “degree of consistency” $\epsilon $ will depend on the quantity of data available to us and the frequency of the reference class, which in turn control the accuracy to which we can estimate distributional parameters like ${\mathbb{E}_{}}[F\left( e \right)|e \in S]$ .

Just as the individual reference class problem arises when we have an individual to whom there are multiple, seemingly equally good ways to assign probability forecasts as a function of different reference classes, the reference class problem at scale will arise when we have multiple, seemingly equally good models for making predictions at scale (i.e., that are both consistent with the same set of reference classes) that frequently disagree with one another. Informally, we say that the reference class problem at scale arises with respect to a collection of reference classes if both models are consistent with all of the reference classes in the set and yet frequently make predictions that substantially differ from one another. More precisely:

Definition 3.2 Fix a universe of elements $e \in {\cal X}$ (corresponding to, e.g., records pertaining to people) and distribution over these elements. Fix a Boolean predicate Footnote 7 $F:{\cal X} \to \left\{ {0,1} \right\}$ , as well as a collection of subsets of the universe of elements ${S_1}, \ldots, {S_n} \subseteq {\cal X}$ called reference classes. The reference class problem at scale arises if we have two different models ${f_1},{f_2}:{\cal X} \to \left[ {0,1} \right]$ that are both $\epsilon $ -consistent on all of the reference classes but that frequently make substantially different predictions:

$$Pr\left[ {{f_1}\left( e \right){ {\not\approx} _\epsilon }{f_2}\left( e \right)} \right] \ge \epsilon .$$

Here, $a{ {\not\approx} _\epsilon }b$ if $\left| {a - b} \right| \gt \epsilon $ . Once again, the parameter $\epsilon $ , which we use to measure both consistency with the reference classes and disagreement between the models, should be small, and what a reasonable value is depends on the amount of data we have to estimate the statistical parameters appearing in the definition.

The reference class problem at scale, which we also call a statistical evidence paradox, would be paradoxical because it suggests that two models, equally supported by the data, can nonetheless disagree substantially on their predictions. Although not in the language of reference classes, the general problem of having multiple models “equally supported by the data” that make very different predictions has been noted in the machine learning literature as the predictive multiplicity or model multiplicity problem (Marx et al. Reference Marx, Calmon, Ustun, Daumé and Singh2020; Black et al. Reference Black, Raghavan and Barocas2022).

An initial question is whether we can reasonably expect to get far enough to encounter a statistical evidence paradox. After all, we have defined one as occurring if we have two models that are both consistent with all of the reference classes we have considered and yet have substantial disagreements about their predictions. Perhaps, just as with the individual reference class problem, it is difficult or impossible to find a model that is simultaneously consistent with many incomparable reference classes. Fortunately, finding a model that is consistent with many reference classes is possible—even from finite data. This is known as multicalibration, introduced by Hébert-Johnson et al. (Reference Hébert-Johnson, Kim, Reingold, Rothblum, Dy and Krause2018) and with intellectual roots dating back to Martin-Löf (Reference Martin-Löf1966), Von Mises (Reference Von Mises1981), and Dawid (Reference Dawid1985). Martin-Löf (Reference Martin-Löf1966) and Von Mises (Reference Von Mises1981) give a theory of randomness based on “collectives,” and Dawid (Reference Dawid1985) gives a calibration-based foundation for empirical probability, all of which are based on the idea of consistency with respect to collections of selection rules that can subselect a data sequence based on its observable properties. Selection rules can be thought of as defining reference classes; indeed, Salmon (Reference Salmon1977) provided philosophical foundations for proper reference classes, specifying that they should be “homogeneous”—that is, that it should not be possible to apply a further subselection within a reference class in such a way that the conditional probability of the outcome changes. Whereas Martin-Löf (Reference Martin-Löf1966), Salmon (Reference Salmon1977), Von Mises (Reference Von Mises1981), and Dawid (Reference Dawid1985) were all concerned with the set of all selection rules or reference classes (or countably infinite sets of reference classes, or all computable reference classes), which in the end turns out to give little guidance in settings in which we have only finite data at our disposal, multicalibration as defined by Hébert-Johnson et al. (Reference Hébert-Johnson, Kim, Reingold, Rothblum, Dy and Krause2018) focuses on collections of reference classes, which we can perform statistical estimation over using finite amounts of dataFootnote 8

Here, we give an equivalent variant of the definition of multicalibration using the language of reference classes. Note that the reference classes may be arbitrary and may even be defined in reference to the model $f$ (e.g., “the set of all people $x$ such that $f\left( x \right) = 0.2$ ”).

Definition 3.3 A model of individual probabilities $f:{\cal X} \to \left[ {0,1} \right]$ is $\epsilon $ -multicalibrated with respect to a collection of reference classes ${\cal S}$ if it is simultaneously $\epsilon $ -consistent with each reference class $S \in {\cal S}$ .

Hébert-Johnson et al. (Reference Hébert-Johnson, Kim, Reingold, Rothblum, Dy and Krause2018) (see also Roth [Reference Roth2022] for a textbook exposition) show that for any set of reference classes ${\cal S}$ , it is always possible to find a model $f$ that is $\epsilon $ -multicalibrated, with both data and computational requirements that scale reasonably with (i.e., are low-degree polynomial functions of) the inverse error tolerance $1/\epsilon $ , ${\rm{log}}\left| {\cal S} \right|$ , and ${\rm{ma}}{{\rm{x}}_{S \in {\cal S}}}1/{\rm{Pr}}\left[ {e \in S} \right]$ , the inverse of the frequency of the least common reference class. In fact, the guarantee is stronger: given any model $f$ , it is possible to modify the model (using a modest amount of data) to provide the guarantee that the modified model $f{\rm{'}}$ is $\epsilon $ -consistent with an arbitrary set of reference classes ${\cal S}$ while only improving the squared error of the model.

Definition 3.4 Fix a universe of elements $e \in {\cal X}$ (corresponding to, e.g., records pertaining to people) and a distribution ${\cal D}$ over these elements. Fix a model $f:{\cal X} \to \left[ {0,1} \right]$ and a binary predicate $F:{\cal X} \to \left\{ {0,1} \right\}$ . The squared error (or Brier score) of the model $f$ with respect to $F$ is

$$B\left( f \right) = \mathop {\mathbb{E}}\limits_{} [\left( {f\left( e \right) - F\left( e \right){)^2}} \right].$$

As an intermediate lemma, Hébert-Johnson et al. (Reference Hébert-Johnson, Kim, Reingold, Rothblum, Dy and Krause2018) show that given any model $f$ , if we discover a reference class $S$ on which $f$ is not $\epsilon $ -consistent, it is possible (from small amounts of data) to produce a new model $f{\rm{'}}$ that has a lower Brier score (see also rothuncertain [2022] for a formulation closer to what we state here):

Lemma 3.1 (Hëert-Johnson et al. [Reference Hébert-Johnson, Kim, Reingold, Rothblum, Dy and Krause2018]). Given a model $f$ and a reference class $S$ such that

  1. 1. $f$ is not $\epsilon $ -consistent on $S$ , and

  2. 2. $S$ has a probability mass of at least ${\mu _S}$ : ${\rm{Pr}}\left[ {e \in S} \right] \geq {\mu _S}$ ,

then it is possible to efficiently (with an amount of data sampled independent and identically distributed [i.i.d.] from the underlying distribution scaling polynomially with $1/{\mu _S}$ and $1/\epsilon $ ) produce a model $f'$ such that Footnote 9

$$B\left( {f{\rm{'}}} \right) \le B\left( f \right) - {\rm{\Theta }}\left( {{\epsilon ^2}{\mu _S}} \right)$$

Let us pause to consider the implications of this lemma, which are several. First, suppose there is a fixed collection ${\cal S}$ of reference classes that we wish our model to be $\epsilon $ -consistent with respect to. One way to achieve our goal is to repeatedly check whether our current model fails to be $\epsilon $ -consistent with respect to any $S \in {\cal S}$ , and if so, update the model using the update from lemma 3.1. Because each time this occurs, the Brier score decreases, and because the Brier score cannot go below zero, this process is guaranteed to halt (after at most $O\left( {{\rm{ma}}{{\rm{x}}_{S \in S}}{1 \over {{\epsilon ^2}{\mu _S}}}} \right)$ many iterations) with a model that is $\epsilon $ -consistent on every reference class in ${\cal S}$ —this is essentially the algorithm given by Hébert-Johnson et al. (Reference Hébert-Johnson, Kim, Reingold, Rothblum, Dy and Krause2018). Moreover, this process is only accuracy-improving—informally, because the “true individual probabilities” would be consistent with respect to every reference class and would also be the global minimizers of the Brier score (because the Brier score is a proper scoring rule), the updates in lemma 3.1 “march toward truth.”Footnote 10 Finally, the only way this procedure needs to interact with the data is by estimating conditional expectations over events (reference classes) that have nontrivially large probability, which is a purely statistical problem that can be solved with modest amounts of data. So, it is indeed possible to satisfy the preconditions of a statistical evidence paradox—to find models that are consistent with any collection ${\cal S}$ of reference classes that we may care to select. But lemma 3.1 can be applied iteratively even if we do not commit ahead of time to the reference classes that we will ask for consistency over, which is what allows us to avoid the reference class problem at scale. In the next section, we describe the “cross-calibration” approach taken by Roth et al. (Reference Roth, Tolbert and Weinstein2023). Informally speaking, “calibration” asks that a single model be consistent with reference classes defined by its own predictions. Given two models, cross-calibration asks that each model be consistent with the reference classes defined with respect to both itself and the other model. The cross-calibration approach we discuss in the next section, informally speaking, takes as input two models and, for each, constructs reference classes defined jointly by the predictions of both models. It then asks for multicalibration with respect to these reference classes. This procedure is iterated until a fixed point is reached, and both models are consistent with respect to reference classes defined with respect to both themselves and their counterparts. An important part of the argument is that the fixed point is reached quickly.

3.2 Resolving the reference class problem at scale

Suppose we were to find ourselves in the presence of a reference class problem at scale; that is, we have two models, ${f_1}$ and ${f_2}$ , that substantially disagree substantially frequently, despite both being equally consistent on the collection ${\cal S}$ of reference classes that we have thought to check. To make this quantitative, suppose that for both $f \in \left\{ {{f_1},{f_2}} \right\}$ and every $S \in {\cal S}$ ,

$$\mathop {\mathbb{E}}\limits_{} [ {f\left( e \right)| {e \in S] {{ \approx _{\epsilon /2}}\mathop {\mathbb{E}}\limits_{} } [F\left( e \right)} |e \in S} ].$$

And yet,

$${\rm{Pr}}\left[ {{f_1}\left( e \right){ {\not\approx} _\epsilon }{f_2}\left( e \right)} \right] \ge \epsilon .$$

The plan will be to construct a new reference class $S\left( {{f_1},{f_2}} \right)$ such that

  1. 1. $S\left( {{f_1},{f_2}} \right)$ is substantially large, and

  2. 2. at least one of ${f_1}$ or ${f_2}$ fails to be $\epsilon $ -/2-consistent with respect to $S\left( {{f_1},{f_2}} \right)$ .

If we can constructively find such a reference class, then not only have we falsified at least one of ${f_1}$ and ${f_2}$ , but we can also add $S\left( {{f_1},{f_2}} \right)$ to our set ${\cal S}$ and perform the update referred to in lemma 3.1. Because the set $S\left( {{f_1},{f_2}} \right)$ was “substantially large,” this update will significantly reduce the Brier score of at least one of the two models, and so (again, because Brier scores cannot become negative), this process must converge quickly. But if we are always able to find such a reference class $S\left( {{f_1},{f_2}} \right)$ given any two models that witness a reference class problem at scale with the parameters we have specified, then it must be that when the process halts, the reference class problem at scale has been resolved. Moreover, because every step of this process was accuracy-improving, all parties should prefer the models that are produced via this process compared with the models that were input into it: the models that were input into it—unless they survived to be output without modification—were each falsified by some reference class in ${\cal S}$ (whereas the output models are consistent with all of these reference classes). The models that are output have a lower Brier score than all of the previous models produced by this sequence of updates, including the input models, and hence falsify all of the models that precede them.

The problem then reduces to the problem of finding a reference class $S\left( {{f_1},{f_2}} \right)$ , given two models ${f_1},{f_2}$ that satisfy

$${\rm{Pr}}\left[ {{f_1}\left( e \right){ {\not\approx} _\epsilon }{f_2}\left( e \right)} \right] \ge \epsilon .$$

That achieves the two desiderata from earlier. Here is the construction given by Roth et al. (Reference Roth, Tolbert and Weinstein2023) (a different construction used by Garg et al. [Reference Garg, Kim and Reingold2019] would also work here). Define the “ $\epsilon $ -disagreement region” of the two models ${D_\epsilon }\left( {{f_1},{f_2}} \right)$ to be the set of points $e \in {\cal X}$ such that the two models produce predictions that differ by at least $\epsilon $ :

$${D_\epsilon }\left( {{f_1},{f_2}} \right) = \left\{ {e \in {\cal X}:{f_1}\left( e \right){ {\not\approx} _\epsilon }{f_2}\left( e \right)} \right\}.$$

By hypothesis, we know that ${\rm{Pr}}\left[ {e \in {D_\epsilon }\left( {{f_1},{f_2}} \right)} \right] \geq \epsilon $ . Observe now that we can partition the disagreement regions into two disjoint regions: those points on which ${f_1}$ makes a larger prediction than ${f_2}$ and those points on which ${f_2}$ makes a larger prediction than ${f_1}$ :

$${D_\epsilon }\left( {{f_1},{f_2}} \right) = D_\epsilon ^ + \left( {{f_1},{f_2}} \right) \cup D_\epsilon ^ - \left( {{f_1},{f_2}} \right),$$

where

$$\begin{align}& D_\epsilon ^ + \left( {{f_1},{f_2}} \right) =\{ e \in {D_\epsilon }\left( {{f_1},{f_2}} \right):{f_1}\left( x \right) \lt {f_2}\left( x \right)\} {\rm{\;\;\;\;\;\;\;\;}}and{\rm{\;\;\;\;\;\;\;\;}}\cr& D_\epsilon ^ - \left( {{f_1},{f_2}} \right) = \{ e \in {D_\epsilon }\left( {{f_1},{f_2}} \right):{f_1}\left( x \right) \gt {f_2}\left( x \right)\}.\end{align}$$

We claim that at least one of $D_\epsilon ^ + \left( {{f_1},{f_2}} \right)$ or $D_\epsilon ^ - \left( {{f_1},{f_2}} \right)$ satisfies our desiderata.

  1. 1. At least one of $D_\epsilon ^ + \left( {{f_1},{f_2}} \right)$ and $D_\epsilon ^ - \left( {{f_1},{f_2}} \right)$ must be “substantially large.” In particular, because ${\rm{Pr}}\left[ {e \in {D_\epsilon }\left( {{f_1},{f_2}} \right)} \right] \geq \epsilon $ and $D_\epsilon ^ + \left( {{f_1},{f_2}} \right)$ and $D_\epsilon ^ - \left( {{f_1},{f_2}} \right)$ form a partition of ${D_\epsilon }\left( {{f_1},{f_2}} \right)$ , we must have that for at least one set $D_\epsilon ^ \circ \left( {{f_1},{f_2}} \right) \in \left\{ {D_\epsilon ^ + \left( {{f_1},{f_2}} \right),D_\epsilon ^ - \left( {{f_1},{f_2}} \right)} \right\}$ ,

    $${\rm{Pr}}\left[ {e \in D_\epsilon ^\circ \left( {{f_1},{f_2}} \right)} \right] \geq {\epsilon \over 2}.$$
  2. 2. At least one of ${f_1}$ and ${f_2}$ fails to be $\epsilon $ -/2-consistent with respect to $D_{}^ \circ \left( {{f_1},{f_2}} \right)$ . This is because by construction,

    $$| {\mathop {\mathbb{E}}\limits_{} [ {{f_1}( e )| {e \in D_\epsilon ^ \circ ( {{f_1},{f_2}} )] { - \mathop {\mathbb{E}}\limits_{} } [{f_2}( e )} |e \in D_\epsilon ^ \circ ( {{f_1},{f_2}} )} ]} | \gt \epsilon, $$
    and so whatever value ${\mathbb{E}_{}}[F\left( e \right)|e \in D_\epsilon ^ \circ \left( {{f_1},{f_2}} \right)]$ takes, we must have the following for at least one $f \in \left\{ {{f_1}{,f_2}} \right\}$ :
    $$| {\mathop {\mathbb{E}}\limits_{} [ {f( e )| {e \in D_\epsilon ^ \circ ( {{f_1},{f_2}} )] { - \mathop {\mathbb{E}}\limits_{} } [F( e )} |e \in D_\epsilon ^ \circ ( {{f_1},{f_2}} )} ]} | \gt \epsilon /2.$$

Thus, we can choose our reference class to be $S\left( {{f_1},{f_2}} \right) = D_\epsilon ^ \circ \left( {{f_1},{f_2}} \right)$ and be guaranteed that lemma 3.1 can be applied so as to decrease the Brier score of at least one of the two models by $O\left( {{\epsilon ^3}} \right)$ . Therefore, after at most $O\left( {1/{\epsilon ^3}} \right)$ iterations of this procedure, we have resolved any instance of the reference class problem at scale (up to parameter $\epsilon $ ). Roth et al. (Reference Roth, Tolbert and Weinstein2023) call this procedure “model reconciliation,” and the resulting theorem can be formalized as follows:

Theorem 3.1 (Roth et al. Reference Roth, Tolbert and Weinstein2023 ). Fix any $\epsilon, \delta \gt 0$ . There is an efficient algorithmic procedure (“Reconcile”) taking as input $O\left( {ln\left( {1/\delta } \right)/{\epsilon ^5}} \right)$ samples from the distribution that can guarantee the following. Given any two models ${f_1},{f_2}$ , Reconcile outputs a pair of models $f{'_1}$ and $f{'_2}$ such that with probability $1 - \delta $ :

  1. 1. $f{'_1}$ and $f{'_2}$ have only a lower Brier score than ${f_1}$ and ${f_2}$ :

    $$B\left( {f{'_1}} \right) \le B\left( {{f_1}} \right)\;\;\;\;\;\;\;\;and\;\;\;\;\;\;\;\;\;\;B\left( {f{'_2}} \right) \le B\left( {{f_2}} \right),$$
    with the inequalities strict whenever ${f_{i'}} \ne {f_i}$ , and
  2. 2. $f{'_1}$ and $f{'_2}$ almost agree almost everywhere:

    $${\rm{Pr}}\left[ {f{{\rm{'}}_1}\left( e \right){ {\not\approx} _\epsilon }f{{\rm{'}}_2}\left( e \right)} \right] \le \epsilon .$$

A brief remark is in order to clarify the parameters of the theorem. A consequence of lemma 3.1 is that whenever we encounter a reference class that has a probability of at least $\epsilon $ on which a model fails to be $\epsilon $ -consistent, we can update the model to reduce its squared error by at least ${\rm{\Theta }}\left( {{\epsilon ^3}} \right)$ . Because the squared error is bounded between $0$ and $1$ , in the worst case, it starts at $1$ . By performing these updates, we can drive it down to $0$ , which would require $O\left( {1/{\epsilon ^3}} \right)$ such iterations (we cannot have more than this without driving the squared error to be negative, an impossibility). However, at each iteration, we need to be able to verify from samples that (with confidence $1 - \delta $ ) at least one of the constructed reference classes $D_\epsilon ^ + \left( {{f_1},{f_2}} \right)$ and $D_\epsilon ^ - \left( {{f_1},{f_2}} \right)$ has a probability mass of at least $\epsilon /2$ . This requires $O\left( {{\rm{log}}\left( {1/\delta } \right)/{\epsilon ^2}} \right)$ samples. Multiplying the two bounds gives the bound on the required number of samples in theorem 3.1. We have chosen to state a simple bound, although it is not the quantitatively tightest bound known. For example, if our initial models ${f_1}$ and ${f_2}$ are nontrivial, they will not have a maximal squared error to begin with: if the squared error of the worst of them ${\rm{max}}\left( {B\left( {{f_1}} \right),B\left( {{f_2}} \right)} \right) \le E$ is bounded by $E \lt 1$ , then the number of iterations improves to $O\left( {E/{\epsilon ^3}} \right)$ , and the data requirement bound improves to $O\left( {E{\rm{ln}}\left( {1/\delta } \right)/{\epsilon ^5}} \right)$ . If we do not naively take fresh samples at every iteration but reuse them using so-called “adaptive data analysis” techniques (Dwork et al. Reference Dwork, Vitaly Feldman, Pitassi, Reingold and Leon Roth2015; Bassily et al. Reference Bassily, Nissim, Smith, Steinke, Stemmer and Ullman2016; Jung et al. Reference Jung, Katrina Ligett, Roth, Sharifi-Malvajerdi and Shenfeld2021), as was originally done in the analysis of multicalibration by Hébert-Johnson et al. (Reference Hébert-Johnson, Kim, Reingold, Rothblum, Dy and Krause2018), then the bound can be further improved to $O\left( {\sqrt E {\rm{ln}}\left( {1/\delta } \right)/{\epsilon ^{3.5}}} \right)$ . We do not wish to focus on the precise dependencies in this bound but to emphasize that it scales only with the error parameters $\epsilon $ and $\delta $ , not with the complexity of the prediction problem itself. For example, the number of samples needed is independent of how rich the feature space is, so, for example, we can represent individual people with representations $e$ consisting of every conceivable piece of information we can record about them without increasing our data requirements. Similarly, our initial models ${f_1}$ and ${f_2}$ can be arbitrarily sophisticated without increasing our data requirements—in fact, this will improve our data requirements to the extent that it decreases the squared error $E$ of our initial models. To give some sense of the scale of the computation and data requirements, suppose that our initial models have squared error bounded by $E \le 0.1$ and that we run the algorithm from theorem 3.1 parameterized to guarantee that with 95% confidence ( $\delta = 0.05$ ), the final pair of models will agree in their predicted individual probabilities up to $ \pm 0.05$ on $95{\rm{\% }}$ of examples ( $\epsilon = 0.05$ ). With these parameters, the number of rounds the algorithm must run before convergence scales as $E/{\epsilon ^3} = 800$ —something that can be done in seconds on a modern computer—and the number of data points required scales as $\sqrt E {\rm{ln}}\left( {1/\delta } \right)/{\epsilon ^{3.5}} \le 34,000$ , a nontrivial but entirely reasonable number of samples in any large-scale prediction problem and orders of magnitude less than the numbers used to train modern neural network architectures.

Thus, we see that scale actually mitigates the reference class problem: although it does not (and cannot) eliminate the individual reference class problem, given modest amounts of data, we cannot have multiple models mapping individuals to predictions that are both equally consistent with the data and make substantially different predictions on a substantial number of individuals. This is because, given two such substantially different models, we have a constructive procedure that is guaranteed to falsify and improve at least one of the two models. Moreover, the quantity $\epsilon $ parameterizing the word “substantially” can be driven toward zero by collecting an amount of data scaling polynomially with $1/\epsilon $ . Data, therefore, can be used to quantitatively mitigate the reference class problem at scale in a way that data fundamentally cannot be used to mitigate the individual reference class problem. We remark that this theorem does not imply that models trained using standard methods might not disagree substantially on many predictions—indeed, this is known to occur (Marx et al., Reference Marx, Calmon, Ustun, Daumé and Singh2020). Rather, what theorem 3.1 implies is that given two such disagreeing models, there is a lightweight procedure (that nevertheless requires a modest amount of additional training data and computing) that can resolve the disagreements in an accuracy-improving way. To return to the health-care example we used to introduce the reference class problem at scale: if the health-care system finds itself in possession of two models for predicting hospital readmission risk that substantially disagree, rather than choosing (arbitrarily) to act on one of them rather than the other, it can apply the model reconciliation procedure to statistically falsify one or both of the models and find more accurate models that rarely disagree. If it finds itself in this position once again in the future, it can iterate the procedure and, therefore, never find itself needing to choose between two equally accurate and well-justified models that nevertheless suggest substantially different downstream actions.

4 Conclusion

In this article, we have drawn a division between the classical (“individual”) reference class problem, which concerns single predictions, and the reference class problem at scale, which concerns models that systematically map data to predictions. The atomic object of the individual reference class problem is a single prediction; the atomic object of the reference class problem at scale is a model. In both cases, a reference class problem comes about if we have two conflicting atomic objects (substantially different predictions/substantially different models) that are equally consistent with the data before us. Despite the fact that a model is simply a large collection of predictions, and the reference class problem can arise for each of the predictions individually, we have shown that the problem cannot compound across these predictions. So, the reference class problem at scale cannot occur to a substantial quantitative degree when data are prevalent. This means that in scenarios in which we systematically make many predictions (insurance, medicine, etc.), the reference class problem may have limited bite. We simply cannot find ourselves in a situation in which we have multiple models that substantially disagree on many predictions and are unable to adjudicate between them using purely statistical means.

Acknowledgments

We thank Scott Weinstein, Thomas Grote, and Kasey Genin for their helpful discussions and Deborah Mayo for first remarking on the similarity between multicalibration and classical work on the reference class problem. A. R. is supported in part by the Hans Sigrist Prize, the Simons Collaboration on the Theory of Algorithmic Fairness, and National Science Foundation (NSF) grants FAI-2147212 and CCF-2217058.

Footnotes

1 Salmon (Reference Salmon1977) specifies that reference classes should be “objectively homogeneous”—that is, chosen such that no further partitioning of the reference class results in a different conditional probability for the event in question compared with its marginal probability over the whole reference class. But because this is a requirement over all possible partitionings of the data, it is not something that can be found (or even verified) using finite data and computation—a difficulty shared with collective-based formalizations of randomness (Martin-Löf, Reference Martin-Löf1966) and probability (Dawid, Reference Dawid1985).

2 If we desire, we may also view $F$ as being a mapping $F:{\cal X} \to {\rm{\Delta }}\left\{ {0,1} \right\}$ from universe elements to random variables supported on $\left\{ {0,1} \right\}$ , allowing that the outcome in question might still be stochastic even conditioning on $e$ . This might be the case if, for example, $F\left( e \right)$ represents some future (as-yet-unrealized) outcome for an individual $e$ .

3 Because we allow that $F\left( e \right)$ may be a random variable, we allow (but do not require) that even fixing $e$ , ${\rm{Pr}}[F\left( e \right)|e]$ may take values strictly between $0$ and $1$ , in a way that could be consistent with, for example, probabilistic evolution of the universe, even fixing all possible current observations. For example, we allow that it might be coherent to speak of a 40% chance of rain tomorrow even when conditioning on all possible current observations.

4 Buchholz (Reference Buchholz2023) has argued that in certain cases, the ability of neural networks to learn in high-dimensional spaces without overfitting suggests otherwise. We disagree with this assertion—in fact, the empirical finding that different, equally accurate neural networks can produce frequently disagreeing predictions (Marx et al., Reference Marx, Calmon, Ustun, Daumé and Singh2020; Black et al., Reference Black, Raghavan and Barocas2022) is strong evidence against it.

5 The idea that we should choose the “most specific” reference class about which we have reliable data goes back to Reichenbach (Reference Reichenbach1971).

6 We here speak of “statistical falsification” in the sense of hypothesis testing, not in the sense of logical falsification. Informally, a model is statistically falsified if we can reject the null hypothesis that it is producing forecasts of “true individual probabilities”—or that it is simultaneously consistent with every possible reference class in the sense of definition 3.1 at some level of confidence.

7 Once again, if we like, we can take $F:X \to {\rm{\Delta }}\left\{ {0,1} \right\}$ to map universe elements to random variables supported on $\left\{ {0,1} \right\}$ , allowing the outcome $F\left( e \right)$ to be random, even conditional on $e$ .

8 Indeed, not just finite amounts of data but also amounts of data and computation that we can control using modestly growing functions of the parameter $\epsilon $ governing calibration error.

9 Here, in writing ${\rm{\Theta }}\left( {{\epsilon ^2}{\mu _S}} \right)$ , we are using the asymptotic notation common in mathematical statistics and computer science. In this usage, it is merely simplifying the expression by hiding constants—specifying that there exist positive constants ${c_1},{c_2}$ such that for sufficiently small values of both $\epsilon $ and ${\mu _S}$ , we have that ${c_1}{\epsilon ^2}{\mu _S} \le B\left( f \right) - B\left( {f{\rm{'}}} \right) \le {c_2}{\epsilon ^2}{\mu _S}$ . Writing $O\left( \cdot \right)$ rather than ${\rm{\Theta }}\left( \cdot \right)$ indicates the upper bound without the lower bound.

10 An expression we first heard from Cynthia Dwork.

References

Bassily, Raef, Nissim, Kobbi, Smith, Adam, Steinke, Thomas, Stemmer, Uri, and Ullman, Jonathan. 2016. “Algorithmic Stability for Adaptive Data Analysis.” In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, 1046–59. New York: Association for Computing Machinery. https://doi.org/10.1145/2897518.2897566.Google Scholar
Black, Emily, Raghavan, Manish, and Barocas, Solon. 2022. “Model Multiplicity: Opportunities, Concerns, and Solutions.” In FAccT ’22: 2022 ACM Conference on Fairness, Accountability, and Transparency, 850–63. New York: Association for Computing Machinery. https://doi.org/10.1145/3531146.3533149.Google Scholar
Bolinger, Renée Jorgensen. 2020. “The Rational Impermissibility of Accepting (Some) Racial Generalizations.” Synthese 197 (6):2415–31. https://doi.org/10.1007/s11229-018-1809-5.Google Scholar
Buchholz, Oliver. 2023. “The Deep Neural Network Approach to the Reference Class Problem.” Synthese 201 (3):111. https://doi.org/10.1007/s11229-023-04110-9.Google Scholar
Cohen, L. Jonathan. 1977. The Probable and the Provable. Oxford: Oxford University Press. https://doi.org/10.1093/acprof:oso/9780198244127.001.0001.Google Scholar
Dawid, A. P. 1985. “Calibration-Based Empirical Probability.” Annals of Statistics 13 (4):1251–74. https://doi.org/10.1214/aos/1176349736.Google Scholar
Dawid, Philip. 2017. “On Individual Risk.” Synthese 194 (9):3445–74. https://doi.org/10.1007/s11229-015-0953-4.Google Scholar
Dwork, Cynthia, Vitaly Feldman, Moritz Hardt, Pitassi, Toniann, Reingold, Omer, and Leon Roth, Aaron. 2015. “Preserving Statistical Validity in Adaptive Data Analysis.” In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, 117–26. New York: Association for Computing Machinery. https://doi.org/10.1145/2746539.2746580.Google Scholar
Dwork, Cynthia, Kim, Michael P., Reingold, Omer, Rothblum, Guy N., and Yona, Gal. 2021. “Outcome Indistinguishability.” In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, 1095–108. New York: Association for Computing Machinery. https://doi.org/10.1145/3406325.3451064.Google Scholar
Franklin, John Hope. 2005. Mirror to America: The Autobiography of John Hope Franklin. New York: Macmillan. ISBN 9780374530471.Google Scholar
Gardiner, Georgi. 2018. “Evidentialism and Moral Encroachment.” In Believing in Accordance with the Evidence: New Essays on Evidentialism, edited by McCaine, Kevin, 169–95. New York: Springer. https://doi.org/10.1007/978-3-319-95993-1.Google Scholar
Garg, Sumegha, Kim, Michael P., and Reingold, Omer. 2019. “Tracking and Improving Information in the Service of Fairness.” In Proceedings of the 2019 ACM Conference on Economics and Computation, 809–24. New York: Association for Computing Machinery. https://doi.org/10.1145/3328526.3329624.Google Scholar
Hájek, Alan. 2007. “The Reference Class Problem Is Your Problem Too.” Synthese 156 (3):563–85. https://doi.org/10.1007/s11229-006-9138-5.Google Scholar
Hébert-Johnson, Ursula, Kim, Michael, Reingold, Omer, and Rothblum, Guy. 2018. “Multicalibration: Calibration for the (Computationally-Identifiable) Masses.” In Proceedings of the 35th International Conference on Machine Learning, edited by Dy, Jennifer and Krause, Andreas, 19391948. New York: PMLR. https://proceedings.mlr.press/v80/hebert-johnson18a.html.Google Scholar
Jung, Christopher, Katrina Ligett, Seth Neel, Roth, Aaron, Sharifi-Malvajerdi, Saeed, and Shenfeld, Moshe. 2021. “A New Analysis of Differential Privacy’s Generalization Guarantees (invited paper).” In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, 9. New York, NY: Association for Computing Machinery. https://doi.org/10.1145/3406325.3465358.Google Scholar
Martin-Löf, Per. 1966. “The Definition of Random Sequences.” Information Control 9 (6):602–19. https://doi.org/10.1016/S0019-9958(66)80018-9.Google Scholar
Marx, Charles, Calmon, Flavio, and Ustun, Berk. 2020. “Predictive Multiplicity in Classification.” In ICML’20: Proceedings of the 37th International Conference on Machine Learning, edited by Daumé, Hal and Singh, Aarti, 6765–74. New York: PMLR.Google Scholar
Reichenbach, Hans. 1971. The Theory of Probability. 2nd edition. Berkeley, CA: University of California Press. ISBN 978-0520019294.Google Scholar
Rhee, Robert J. 2007. “Probability, Policy and the Problem of Reference Class.” International Journal of Evidence & Proof 11 (4):286–91. https://doi.org/10.1350/ijep.2007.11.4.286.Google Scholar
Roth, Aaron. 2022. “Uncertain: Modern Topics in Uncertainty Estimation.” https://www.cis.upenn.edu/aaroth/uncertainty-notes.pdf.Google Scholar
Roth, Aaron, Tolbert, Alexander, and Weinstein, Scott. 2023. “Reconciling Individual Probability Forecasts.” In FAccT ’23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, 101–10. New York: Association for Computing Machinery. https://doi.org/10.1145/3593013.3593980.Google Scholar
Salmon, Wesley C. 1977. “Objectively Homogeneous Reference Classes.” Synthese 36 (4):399414. https://doi.org/10.1007/BF00486104.Google Scholar
Venn, John. 1866. The Logic of Chance: An Essay on the Foundations and Province of the Theory of Probability, with Especial Reference to its Logical Bearings and its Application to Moral and Social Science. London, England: Macmillan. ISBN 9783337474676.Google Scholar
Von Mises, Richard. 1981. Probability, Statistics, and Truth. Dover Books on Mathematics. North Chelmsford, MA: Courier Corporation. ISBN 9780486242149.Google Scholar