An Expert-Sourced Measure of Judicial Ideology

Kevin L. Cope

doi:10.1017/pan.2025.10009

An Expert-Sourced Measure of Judicial Ideology

Published online by Cambridge University Press: 15 August 2025

Kevin L. Cope

Show author details

Kevin L. Cope*: Affiliation:
Professor of Law and Public Policy, Professor of Politics (by courtesy), University of Virginia , 580 Massie Rd., Charlottesville, VA 22903, USA.
*: kcope@law.virginia.edu

Article contents

Abstract
Measuring Judicial Ideology
Hierarchical Ngram Text Analysis
Method Validation
Analyses of Federal Circuit Judge Ideology
Conclusion
Data Availability Statement
Competing Interest
Author Biography
Footnotes
References

Rights & Permissions

Abstract

This article develops the first dynamic method for systematically estimating the ideologies and other traits of nearly the entire federal judiciary. The Jurist-Derived Judicial Ideology Scores (JuDJIS) method derives from computational text analysis of over 20,000 written evaluations by a representative sample of tens of thousands of jurists as part of an ongoing, systematic survey initiative begun in 1985. The resulting data constitute not only the first such comprehensive federal-court measure that is dynamic, but also the only such measure that is based on judging, and the only such measure that is potentially multi-dimensional. The results of empirical validity tests reflect these advantages. Validation on a set of several-thousand appellate decisions indicates that the ideology estimates predict outcomes significantly more accurately than the existing appellate measures, such as the Judicial Common Space. In addition to informing theoretical debates about the nature of judicial ideology and decision-making, the JuDJIS initiative might lead courts scholars to revisit some of the lower-court research findings of the last two decades, which are generally based on static, non-judicial models. Perhaps most importantly, this method could foster breakthroughs in courts research that, until now, were impossible due to data limitations.

Keywords

judicial ideology courts measurement natural language processing

Information

Type: Article
Information: Political Analysis , First View , pp. 1 - 20

DOI: https://doi.org/10.1017/pan.2025.10009 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright: © The Author(s), 2025. Published by Cambridge University Press on behalf of The Society for Political Methodology

For decades courts scholars have sought to explain systematically why judges decide cases as they do, often by estimating quantitatively the ideologies of judges (see, e.g., Bailey Reference Bailey2017; Bonica and Sen Reference Bonica and Sen2021; Rohde and Spaeth Reference Rohde and Spaeth1976; Segal and Cover Reference Segal and Cover1989). Broadly speaking, the methods for doing so comprise three types: coding and counting judicial votes; using the preferences of related political actors as proxies for the judge’s ideology; and drawing on third-party commentaries of the judges (Cope Reference Cope, Epstein, Grendstad, Šadl and Weinshall2024). Together, the data produced by these measures over the last few decades have opened the door to a new line of research, which now forms a significant component of judicial politics scholarship.Footnote ¹ As with any measure, each of these methods has unique strengths and limitations (see Bonica et al. Reference Bonica, Chilton, Goldin, Rozema and Sen2017). As to limitations: many metrics are highly attenuated from the concept they purport to measure; most cannot capture more than one dimension; most cannot capture change over time; many fail to distinguish judicial from political ideology; and some capture only a fraction of the judges within a given court. Recognizing some of these limits of existing methods, Epstein, Martin, and Quinn (Reference Epstein, Martin and Quinn2024) recently noted the lack of any comprehensive judicial ideology metric derived from expert experience and called for a new research agenda along those lines.

This article introduces such an initiative: the Jurist-Derived Judicial Ideology Scores (JuDJIS, pronounced “judges”), an expert-sourced approach to measuring judicial traits, which attempts to overcome each of the limitations above. Many legal and political scholars have recognized the challenge of capturing judicial ideology, an inherently subjective, multi-dimensional concept, with voting or political behavior data alone. In that vein, my starting point and core assumption is that legal practitioners and other experts have special insight into how judges decide cases, insight not fully captured by either the often-quite-partisan judicial-appointment process, or judges’ own political, electoral, or managerial (as opposed to judicial) behavior. By assuming that expert judgments can be a valid basis for measuring hard-to-quantify political phenomena, JuDJIS emulates prominent datasets in other fields such as Varieties of Democracy (V-DEM) (Coppedge et al. Reference Coppedge2011), Economic Freedom of the World (Gwartney et al. Reference Gwartney, Lawson, Hall and Murphy2021), the CIRI Human Rights Database (Cingranelli and Richards Reference Cingranelli and Richards2010), and the Chapel Hill Expert Survey (Jolly et al. Reference Jolly2022).

The JuDJIS scores are estimated using a new text-analysis technique that quantifies tens of thousands of written evaluations, systematically solicited from a broad, representative sample of judicial experts. The information is collected by professional survey firms commissioned by the Almanac of the Federal Judiciary, a tri-annually published initiative which, since 1985, has continually surveyed a stratified sample of qualified experts for each judge in the federal judiciary. In response to prompts, the experts use their own words to evaluate the judges in five categories: ability; demeanor; trial practice/oral argument; settlement/opinion quality; and ideology. The evaluations are provided every few years by a set of eight-to-ten lawyers and ex-judicial clerks, each with significant professional experience with the judicial decisions and courtroom behavior of the judge(s) he or she is evaluating. Thus, an established judge would be expected to be evaluated by a representative sample of approximately 16–30 different lawyers over a ten-year period. The Almanac’s 41 years of written evaluations to date comprise over 14,500 documents and 11 million words, with updated volumes released every few months. The corpus comprising this complete set of Almanac volumes was quantified to produce scaled estimates of ideology and other judicial traits.Footnote ²

The JuDJIS method has several important advantages over existing non-Supreme Court measures of judicial ideology. It is the only such comprehensive measure: (1) that is based on judging (rather than, say, campaign contributions or congressional votes); or (2) that allows for change over time; or (3) that can produce scores comprising multiple dimensions, covering several non-ideology judicial traits (to be explored in future work). Its eventual scope—essentially all Article III lower-court judges, over 2,200 in total as of 2024—is larger than any existing set of scores. Perhaps most importantly, the JuDJIS circuit ideology data predict the outcomes of a representative set of case decisions with significantly greater accuracy than any of the three leading circuit-judge ideology measures.

This article contributes to the field of judicial behavior in three key ways. First, by developing a behavioral model of judicial ideology and showing that the judgements of legal experts can outperform political metrics in predicting case outcomes, it informs ongoing theoretical debates about the nature of judicial ideology and decision-making (see, e.g., Bonica and Sen Reference Bonica and Sen2021; Converse Reference Converse2006; Fischman and Law Reference Fischman and Law2009; Gerring Reference Gerring1997; Lammon Reference Lammon2009), including in the United States, but also in the courts of other countries and international bodies. Second, I hope the JuDJIS method—being judging-based, dynamic, comprehensive, and multi-dimensional—will spark a new line of judicial behavior research, by allowing researchers to raise and analyze important questions in judicial politics that have thus far been intractable due to data limitations. It therefore contributes to a line of path-breaking innovations in measuring judicial ideology, such as methods developed by Segal and Cover (Reference Segal and Cover1989), Martin and Quinn (Reference Martin and Quinn2002), and Epstein et al. (Reference Epstein, Martin, Segal and Westerland2007). And analogous to Martin and Quinn (Reference Martin and Quinn2002)’s observation in the context of developing the first dynamic model of the Supreme Court, a dynamic model of the lower courts may also call into question some previous circuit and district court research, which is based almost entirely on static models. Finally, because it applies political-science scaling methods to content derived from doctrinal-legalist perspectives, I hope the JuDJIS method will help to further bridge the theoretical and methodological gulfs that still divide these disciplines.

Section 1 presents the behavioral model and theoretical assumptions motivating the method. Section 2 explains the hierarchical ngram measurement method used to generate the scores. Section 3 empirically validates the method. Section 4 presents the scores for U.S. circuit court judge ideologies, 1990–2017. Section 5 concludes.

1 Measuring Judicial Ideology

Since the early-to-mid twentieth century, researchers have been attempting to use quantitative measures to measure judicial behavior (e.g., Gaudet Reference Gaudet and St. John1933; Nagel Reference Nagel1961; Schubert Reference Schubert1960). In an attempt to predict and explain judges’ rulings, social scientists inspired partly by insights from the attitudinal model of judging have developed a variety of quantitative measures that purport to capture judges’ ideology.Footnote ³

Those existing methods can be divided into three categories: vote counting, proxy, and third-party (cf. Cope Reference Cope, Epstein, Grendstad, Šadl and Weinshall2024; Fischman and Law Reference Fischman and Law2009). Vote-counting entails estimating judicial ideology from judges’ preferred case outcomes as expressed with their votes: either guided (in which researchers attribute substantive values to judges’ votes) (e.g., Spaeth et al. Reference Spaeth, Epstein, Ruger, Whittington, Segal and Martin2014) or agnostic (recording only whether a judge voted with the majority or minority) (e.g., Martin and Quinn Reference Martin and Quinn2002; Voeten Reference Voeten2007; Windett, Harden, and Hall Reference Windett, Harden and Hall2015).

Proxy measures draw on observable traits that are theoretically related to judicial ideology but conceptually distinct from it (Cope Reference Cope, Epstein, Grendstad, Šadl and Weinshall2024). The proxy is selected because the researcher believes it is empirically correlated with the latent trait of judicial ideology. Early proxy methods included judges’ self-identified party affiliations (Nagel Reference Nagel1961; Schubert Reference Schubert1960). Recent proxy methods involve selecting a political actor linked to the judge—often the appointing party or executive—and measuring that actor’s political ideology as a stand-in for the judge’s judicial ideology.

More complex proxy measures incorporate the ideal points of other political actors. They include the Common Space scores (Giles, Hettinger, and Peppers Reference Giles, Hettinger and Peppers2001) and Judicial Common Space (JCS), which rely on the congressional-voting-based NOMINATE ideal points (Poole and Rosenthal Reference Poole and Rosenthal2000) of U.S. senator involved in a judge’s nomination (Epstein, Walker, and Dixon Reference Epstein, Walker and Dixon1989), taking advantage of the Senate “blue-slip” custom, in which the president and Senate consult the senators from the state of the position being filled on candidates (McMillion Reference McMillion2017). Specifically, the JCS considers the NOMINATE score of the judicial vacancy state’s U.S. senator who shares the appointing president’s party, and it attributes the senator’s score to that judge. Boyd (Reference Boyd2011) extends the JCS method to district courts, creating a data set of district judge ideology. Another proxy measure, the Clerk-Based Ideology (CBI) scores (Bonica et al. Reference Bonica, Chilton, Goldin, Rozema and Sen2017), first estimate the political ideology of U.S. federal law clerks based on their campaign contributions. Based on the assumption that judges wish to hire clerks who share their ideologies, CBI uses the mean score of a judge’s clerks as a proxy for the judge’s own judicial ideology. Similarly, Bonica and Sen (Reference Bonica and Sen2017) use the campaign contributions that judges themselves make to candidates for political office (typically, before the judges took the bench) as a proxy for the judges’ judicial ideology.

Third-party measures consist of observers’ judgments or predictions about a judge’s ideology (Cope Reference Cope, Epstein, Grendstad, Šadl and Weinshall2024). Whereas proxy measures derive from personal political behavior of actors in some way connected to the judge, third-party measures purport to observe and evaluate judicial ideology itself (past or anticipated). There are two main types of third-party measures: editorial-based and expert-based. The leading editorial-based third-party measure, Segal and Cover (Reference Segal and Cover1989)’s measure of U.S. Supreme Court justices’ ideology relies on human-coded text analysis of media editorials written about judicial nominees after their nomination but before confirmation.

Third-party expert-based measures are an especially promising, but as-of-yet barely explored (Grendstad, Shaffer, and Waltenburg Reference Grendstad, Shaffer and Waltenburg2012; Wijtvliet and Dyevre Reference Wijtvliet and Dyevre2021) third-party method (Cope Reference Cope, Epstein, Grendstad, Šadl and Weinshall2024). Expert-based measures use opinions of legal experts to create quantitative estimates of judicial ideology and other traits. Like other methods, this approach presents several logistical challenges, but it also has many attractive qualities that might overcome some of the shortcomings of existing approaches. Until now, however, they have seen no known application at large scale, or using computational text analysis, or to any U.S. court.

1.1 An Expert-Sourced Measure of U.S. Federal Judges

The JuDJIS method introduced here incorporates a technique that, to my knowledge, has not been used to measure ideology systematically in any field: computational text analysis of expert evaluations. The method involves a novel text-analysis technique, quantifying to date over three decades of evaluations by tens of thousands of legal experts, eventually covering over 2,200 judges. The evaluations are compiled by academic publisher Wolters Kluwer’s Almanac of the Federal Judiciary and are provided by lawyers and ex-law clerks—in their own words in response to prompts—with professional familiarity with each of the judges, including the judges’ written opinions, judging styles, and courtroom/chambers demeanors (Rosen Reference Rosen2023). In what follows, I set forth the assumptions underlying JuDJIS’s behavioral model of measuring judicial behavior.

1.2 Behavioral Model and Theoretical Assumptions

The choice to measure ideology from written expert evaluations makes two primary assumptions. First, ideology is a latent trait and therefore not directly observable (Fischman and Law Reference Fischman and Law2009). But I believe that a judge’s “normative beliefs about the appropriate functions of law and courts” (Cope Reference Cope, Epstein, Grendstad, Šadl and Weinshall2024) directly cause him or her to apply the law in certain ways in written opinions and oral decisions. This judicial behavior—e.g., written orders and opinions, private settlement discussions in chambers, and comments made from the bench—is therefore a direct manifestation of the judge’s ideology, and it is the closest thing to ideology itself that can be observed. The JuDJIS methodology is therefore unique among comprehensive lower-court measures, in that it captures expressed ideology, rather than others’ expectations or predictions of ideology.

Second, judicial “votes” are undeniably a key type of judicial behavior and highly probative of judicial ideology. But a judge’s judicial reasoning in her written opinions, orders, and analysis from the bench—which the Almanac evaluations capture indirectly via the lawyers’ observations of those behaviors—provides clearer insight into her judicial ideology than her decision to simply affirm or reverse.

For these reasons, a well-designed expert-sourced method based on observed judicial behavior should be able to come reasonably close to observing latent judging traits, including judicial ideology. At a minimum, expert-sourced evaluations are likely to relate more closely to ideology than the observations underlying most or all current judicial ideology measurement approaches, for instance, pre-confirmation speculation about future behavior (Segal and Cover Reference Segal and Cover1989), the campaign contributions of law clerks (Bonica et al. Reference Bonica, Chilton, Goldin, Rozema and Sen2017), judges’ pre-confirmation political campaign contributions (Bonica and Sen Reference Bonica and Sen2017), the appointing president’s party (Nagel Reference Nagel1961; Schubert Reference Schubert1960), the congressional voting records of pre-nomination supporting senators (Epstein et al. Reference Epstein, Martin, Segal and Westerland2007), and perhaps even case votes (Martin and Quinn Reference Martin and Quinn2002; Spaeth et al. Reference Spaeth, Epstein, Ruger, Whittington, Segal and Martin2014).

2 Hierarchical Ngram Text Analysis

I next describe the process for collecting the underlying source material and generating the JuDJIS data.

2.1 The Almanac

The Almanac has been published by Wolters Kluwer Publishing since 1985, with robust lawyers’ evaluations collected by professional surveyors since 1990. It contains detailed biographical data and subjective evaluations entries on all judges in the federal judiciary (including senior judges, as well as bankruptcy, magistrate, and other Article I judges). The judges’ entries comprise biographical information, key cases, and, most important for these purposes, the exclusive-to-the-Almanac lawyers’ evaluations. In response to interviewer prompts, a set of lawyers evaluates different aspects of each judge using the lawyers’ own words. The Almanac routinely updates the judges’ entries, with all judges within a given district or circuit updated in a single batch. The responses form the corpus for the data set.

Wolters Kluwer contracts with professional survey firms to conduct the lawyers’ evaluations. For each judge, they seek a “cross-section of the legal community” with substantial and recent familiarity with that judge (Rosen Reference Rosen2023, Introduction 1). All evaluators are guaranteed anonymity to promote candor. The surveyors attempt to represent lawyers in different sectors, that is, “criminal defense lawyers, plaintiffs’ tort lawyers, defendants’ tort lawyers, commercial lawyers, civil rights lawyers, etc.,” as well as former law clerks (Rosen Reference Rosen2023, Introduction 1).

The strategy to identify the appropriate sample of lawyers is tailored to the particular jurisdiction in question, as different types of districts (e.g., rural/urban, Northeast/South) have different dynamics between and within the federal courts and bar, but the overarching goal for every court is to achieve a representative sample of those familiar with the judges of the court (see Rosen Reference Rosen2023). Indeed, the business model of the for-profit publication depends on its reputation for accuracy, incentivizing it to present valid and unbiased evaluations.

The surveyors identify lawyers through a variety of means, including official court records of appearances and third-party publications listing prominent lawyers in the district or circuit. In general, the lawyers interviewed have personally appeared multiple times over the previous few years before the judge in question. For judges who have served for several years, the surveyors interview eight-to-ten lawyers per survey. In general, all lawyer comments are published, often abridged for the most relevant and substantive language. Over a typical 10-year period, approximately 16 to 30 different lawyers give comments on any established judge. The Supplementary Material gives examples of typical evaluations.

2.2 Text-Analysis: Hierarchical Ngram Analysis

To analyze the corpus, I use a novel text-analysis method that I introduce here: hierarchical ngrams. The method has several advantages—particularly for this type of corpus and research objective—over large-language model machine-learning approaches, most notably, its transparency, explainability, and replicability (see, e.g., Albaugh et al. Reference Albaugh, Sevenans, Soroka and Loewen2013; Albaugh et al. Reference Albaugh, Soroka, Joly, Loewen, Sevenans and Walgrave2014; Grimmer and Stewart Reference Grimmer and Stewart2013). It also overcomes some shortcomings of conventional lexicon-based methods like unigrams or bigrams, such as their inability to draw meaning from syntactic context (cf. Farah and Kakisim Reference Farah and Kakisim2023). The hierarchical ngram method also performs comparably to large-language models in predicting case outcomes (but with the critical benefits of full transparency and replicability). See Section A2 of the Supplementary Material for analyses.

The hierarchical ngram method involves the following steps. First, every ngram from length 2 (bigram) to length 9 (novagram) in the corpus was identified. For the Circuit Ideology dataset, this initial process generated a set of 4,791 unique ngrams. For each of the eight ngram lengths, a threshold frequency was determined for inclusion in the coded dictionary, without regard to content or meaning. The threshold level was determined after examining the relative amount of information contained in each set of ngrams, balancing the conflicting objectives of the greatest possible context and maintaining a dictionary of manageable size.Footnote ⁴ For the Circuit Ideology dataset, the resulting dictionary comprises 2,176 unique ngrams.

Second, a set of trained coders—each upper-level J.D. students with backgrounds in federal courts—assigned a value, ranging from $-$ 3 (extremely liberal) to 3 (extremely conservative), or 99 (no ideological salience) to each ngram. For other datasets, appropriate alternative values are used. All ngrams were coded by three coders using the codebook contained in the Supplementary Material. Intercoder agreement was high;Footnote ⁵ I resolved discrepancies. A majority of ngrams were coded as non-salient, with 45.4% of ngrams assigned a value of $-$ 3 to 3. The [ $-$ 3,3] scale was then converted to a [ $-$ 1,1] scale, as with some existing ideology measures, such as Martin and Quinn (Reference Martin and Quinn2002) and JCS.

Next, the dictionary values were assigned to the judges in the corpus. As Figure 1 shows, a judge evaluation (j) comprises one or more comments (each by a single expert reviewer, on a given topic of a given judge in a given year). Each such comment comprises one or more ngrams (g) of length 2–9.

Figure 1 Structure of document components (evaluation-comment-ngram).

To assign the dictionary values to the corpus, I use the hierarchical ngram algorithm described below and in the Supplementary Material. Intuitively, for each comment, the algorithm identifies all ngrams in the comment that have been assigned a score in the dictionary. There are often multiple non-overlapping ngrams in a given comment that each receive a score. For example, in the following comment, <In employment cases, she’s not really pro-employee, but not really pro-business either; she’s usually right down the middle >, each of the three underlined ngrams would be scored.

Not all ngrams are counted, however, as many overlap with, or are nested within, other ngrams. For instance, the comment <she’s not a particularly harsh sentencer in drug cases> contains the ngrams <harsh sentencer>, <particularly harsh sentencer>, and <not a particularly harsh sentencer>. Each of these ngrams obviously has different meanings. In order to avoid counting and tallying redundant—or, worse, conflicting—scores of overlapping or nested ngrams, the algorithm recognizes a hierarchy of ngrams based on length: only the senior ngram, i.e., the longest in any set of nested ngrams is considered and scored.

(1)

$$ \begin{align}i_c = \frac{ \sum_{g=1}^{G_c} i_g}{G_c} \end{align} $$

(2)

$$ \begin{align} i_j = \frac{\sum_{c=1}^{C_j} i_c}{C_j} \end{align} $$

After the hierarchial process identifies the ngrams to be scored, a judge evaluation is calculated in the following way. First, the ideology of a given comment, $i_c$ , is defined by Equation (1), where a comment, c, comprises $G_c$ ngrams, and each ngram is assigned an ideology, $i_g$ . Thus, $i_c$ is the mean ideology score of the $G_c$ ngrams that make up comment c. In turn, each judge, j, for a given year or set of years has an observed ideology, $i_j$ , defined as in Equation (2), where the judge’s evaluation(s), j, comprises $C_j$ comments. Thus, $i_j$ is the mean ideology score of the $C_j$ comments that make up the judge evaluation(s).Footnote ⁶

To illustrate how this process can produce scores for particular judges, Figure 2 shows the distribution of comment scores for eight circuit judges: four Democratic-appointed (top row) and four Republican-appointed (bottom row). Each judge differs from the others in both mean ideology score and in distribution and variance. These histograms thus illustrate how the measure can be sensitive to small true variation between individual judges, even those appointed by the same president at about the same time.

Figure 2 Distribution of JuDJIS ideology comment scores for sample of eight circuit judges, aggregated over their tenures. Note: Vertical dashed lines denote the mean score. Gray bands denote 90% confidence intervals around the mean score.

For instance, while comments on Judge Stephen Reinhardt’s ideology were not uniform, the majority of comments placed him in the liberal range ( $-.5$ to $-.8$ ). This relative agreement, combined with the sheer number of comments amassed over his long career on the bench, contribute to fairly high confidence (and therefore, narrow confidence intervals, as indicated by the gray vertical band) about his mean score (indicated by the vertical dashed lines). In contrast, comments on Judge Richard Posner take a bimodal distribution, with most evaluators split between labeling him moderate ( $-.2$ to $.2$ ) and conservative ( $.5$ to $.8$ ). Perhaps this division occurs because Judge Posner is often characterized as more libertarian than conservative (Harcourt Reference Harcourt2007), meaning his somewhat left-leaning views on some social issues made it difficult to reach a consensus on his placement on a single-dimensional, left-right scale. The resulting uncertainty about his mean score is somewhat larger than that for Judge Reinhardt.

2.3 Advantages and Potential Critiques

Before empirically validating the method, I explore several of the method’s a priori advantages and potential critiques. Given the theory and method underlying the JuDJIS measures, they have several attractive properties and advantages over existing methods for measuring judicial ideology in the circuit and district courts.

First, the JuDJIS method can include the entire Article III judiciary on one scale, and it tracks changes over time, running from 1990 to the present. Moreover, by drawing on the experiences of legal experts who have studied their opinions and interacted with them in person over time, the JuDJIS scores capture the more subtle nuances of judging, albeit indirectly, beyond simple votes to affirm or reverse. In this one narrow sense, the technique is similar to Segal and Cover (Reference Segal and Cover1989)’s analysis of media op-eds, which, for nominees with judicial experience, draw on the judge’s previous opinions. But unlike Segal and Cover’s source material, which is locked in before the justice is even confirmed, the JuDJIS evaluations are updated continuously based on the judge’s conduct in that judicial position.

Finally, the JuDJIS scores are sensitive to small true variation between judges and to ideological evolution over time. Indeed, because the evaluators’ scores constitute a sample of a theoretical population of all potential evaluators, it can estimate standard errors and confidence intervals for each point estimate.Footnote ⁷

The JuDJIS method is also potentially subject to some critiques, which I attempt to anticipate and address here. First, the evaluations necessarily come from lawyers most familiar with the judges, as no single set of experts is substantially familiar with all (or even a meaningful fraction of) the federal judiciary. Thus, different mixes of evaluators evaluate the judges. So although I cannot definitively rule them out, there is no particular reason to expect systematic differences in evaluator characteristics, given the Almanac’s objectives and survey methodology. Indeed, it has this trait in common with other leading and established social science indicators.Footnote ⁸ Consider further that federal appellate lawyers are generally cosmopolitan legal actors, with many practicing in several circuits. Appellate lawyers tend to keep abreast of developments in other circuits and the Supreme Court. There is no requirement that the lawyers interviewed for the Almanac are geographically located within the circuit they evaluate—only that they have personal experience and expertise with the judge in question.

A related potential issue is the possibility of systemic conscious or unconscious bias against members of certain demographic groups, based on ethnicity, gender, or age, for example (Sen Reference Sen2014a, Reference Sen2014b). First, I note that this potential issue exists for other existing measures of ideology such as JCS and CBI, albeit in different ways.Footnote ⁹ Moreover, recent empirical evidence suggests that demographic-driven bias in evaluating judges is less of a concern than some may fear. In a series of conjoint experiments, Ono and Zilis (Reference Ono and Zilis2022) find that, when asked to evaluate the degree of bias they expect judges with different profiles to exhibit in immigration and abortion cases, the aggregated subject pool either does not distinguish between either male and female judges, or between Black and non-Black judges, or between Asian and non-Asian judges, or else expects that female judges, Black judges, and Asian judges, respectively, are less likely to be biased.Footnote ¹⁰ And to the extent people nonetheless make snap judgements when they have access only to a judge’s superficial characteristics, studies in social psychology (e.g., Pettigrew and Tropp Reference Pettigrew and Tropp2013) have repeatedly shown that bias is mitigated or eliminated as the evaluator interacts with the subject, as with the Almanac’s expert evaluators.

Finally, it is likely that a form of ideological drift occurs, in which what is considered, say, “moderate,” is different in 1990 and 2010. This challenge exists for essentially all attempts to measure ideology over time.Footnote ¹¹ For instance, the NOMINATE-dervied JCS scores are determined relative to the issues facing Congress, such that the meaning of a moderate judge depends on what constitutes centrist behavior in Congress during the given period. Likewise, the JuDJIS scores must be interpreted as estimating a judge’s position relative to the ideology norms of that period. Thus, in using these scores—as with other ideology measures—researchers should be cautious in interpreting changes over long periods. In constrast, over short periods such as within a given administration, change can be interpreted with a relatively high degree of confidence.

3 Method Validation

To determine how well the qualitative evaluations capture the essence of judicial ideology, I compare the JuDJIS Circuit Ideology with scores from established and recently developed datasets of ideology produced using different methods. Based on three different validation metrics using four different data sets, I find that, in predicting case outcomes, the JuDJIS Circuit Ideology scores outperform the three existing circuit ideology measures by significant margins.

3.1 Data Comparison

First, Figure 3 is a matrix of pairwise scatterplots, comparing the JuDJIS Circuit Ideology scores (over each judge’s full tenure) with the scores for those same judges (where available) based on: the Party-of-the-Appointing-President (PAP); the JCS; and the CBI scores. The matrix also compares each of the scores with every other one. A least-squares regression of the y-axis scores on the x-axis scores is indicated by a gray line, with 90% confidence intervals indicated by the lighter gray band. The r values denote the correlation coefficients. The matrix diagonal (from top-left to bottom-right) of Figure 3 is a set of histograms indicating the respective distributions of each of the dataset values.

Figure 3 Matrix of pairwise scatterplots: JuDJIS vs. PAP vs. JCS vs. CBI measures.

Note: Observations denote judges (averaged over each judge’s Court of Appeals tenure) 1990–2024; red ‘R’ = Republican-appointed; blue ‘D’ = Democratic-appointed. Top-left r values indicate correlation coefficient of two score sets.

Reviewing the matrix of scatterplots and histograms, a few things are apparent. First, the datasets have notably different distributions. JCS is bimodal on a $-$ 1 to 1 scale, with the majority of scores falling either in the center-left range ( $-$ .6 to $-$ .4) or center-right range (0.4 to 0.6). This distribution is not surprising given JCS’s design, which assigns judges’ ideological values based primarily on Senate voting records; for all of the covered period, Congress has been highly polarized by party, albeit to different degrees (Barber et al. Reference Barber and McCarty2016; Poole and Rosenthal Reference Poole and Rosenthal1985, Reference Poole and Rosenthal2000). In contrast, both JuDJIS and CBI are unimodal, with CBI skewed right (many more liberal judges than conservative, and a long right tail). Given that lawyers as a group, and especially recent law grads (who are most like to serve as clerks), are more liberal than society generally (Bonica, Chilton, and Sen Reference Bonica, Chilton and Sen2016), this distribution is also unsurprising given CBI’s methodology. Finally, the distribution of JuDJIS data appears to be quite close to normal.

Turning to the scatterplots, JuDJIS is positively correlated with each of the three other ideology measures at statistically significant levels. The substantive correlations are each moderately high, with Pearson correlation coefficients of $r = 0.683$ (vs. PAP), $r = 0.639$ , (vs. JCS), and $r = 0.603$ (vs. CBI). Again, these moderate levels of correlation are unsurprising, given (as discussed in the Supplementary Material) the measures’ different implicit assumptions about the nature of ideology and the strategies for measuring it. Such moderate correlation levels imply that, while the four measures may all be attempting to measure the same general underlying concept, it is plausible that a study’s choice of ideology measure might sometimes affect the results (cf. Cope, Crabtree, and Fariss Reference Cope, Crabtree and Fariss2020).

Another trait the scatterplots reveal is the degree to which the different measures can distinguish between individual judges. To different degrees, both PAP and JCS place judges in noticeable silos, in which many judges share the same ideology score. That this would occur for PAP is self-evident, as there are only two parties that have nominated judges. JCS’s silos are far more numerous, and they group fewer judges together. But they result from a similar phenomenon: a given political actor’s ideology, sometimes that of the president, is attributed to all judges whose appointment for which the actor is responsible. Both CBI and JuDJIS feature very few judges with the same scores, in part because each is based on at least several-dozen individual campaign contributions (for CBI) or evaluations (for JuDJIS). As a result, these scores are more sensitive to small differences between any two judges’ latent ideologies.

Examining a few of the judges lying outside the diagonal also illustrates some key differences between JuDJIS and other methods. For instance, consider the two labeled judges in the JuDJIS-JCS scatterplot in Figure 3, the Fifth Circuit’s John Minor Wisdom (1957–99) and the Ninth Circuit’s Richard Tallman (2000–present). Judge Wisdom was an Eisenhower appointee; a liberal Southern Republican from New Orleans, Judge Wisdom was one of the “Fifth Circuit Four,” a group of judges who significantly expanded civil rights for African-Americans during the 1950s and 60s in the face of strong, sometimes violent, local White opposition (Grinstein Reference Grinstein2020). Taking senior status in 1977, Wisdom was considered among the most progressive judges in the country until his death in 1999. Drawing on the evaluations of the practicing bar, JuDJIS rates Wisdom among the 1% most liberal judges in the data set. JCS, considering the politics of his appointers, rates his as a moderate conservative. Conversely, 1999 Clinton-appointee Judge Richard Tallman was a Republican lawyer who was personally recommended by a prominent conservative Washington state supreme court justice/former Ninth Circuit nominee. Clinton agreed to nominate Tallman as part of a political deal in which Washington’s Republican senator agreed to unblock three of Clinton’s preferred nominees (Slotnick Reference Slotnick2006). JuDJIS, in part reflecting Tallman’s 69% conservative voting record in en banc cases, rates him a moderate conservative. JCS, based on the congressional voting record of Washington’s Democratic (Clinton’s party) senator, considers him a moderate liberal.

3.2 Predicting Case Outcomes

Although they produce several interesting insights, gauging validity by observing how well measures correlate takes us only so far. A better test of a measure’s value is how well it predicts behavior (Cope Reference Cope, Epstein, Grendstad, Šadl and Weinshall2024), that is, its predictive validity. I therefore proceed to determine how well the JuDJIS Circuit Ideology data predict case outcomes relative to how well the three major existing circuit-judge ideology measures, respectively, predict those outcomes. To do so, I draw on a new dataset of en banc decisions, comprising all such decisions during the relevant period (1990–2023), covering all numbered circuits and the D.C. Circuit. By way of background, a federal circuit court hears a small number of cases en banc, in which the whole court typically reviews a decision of a circuit panel. The cases tend to be contentious and are more likely to feature issues implicating traditional ideological cleavages. I validate the data using these cases because they span all circuits and relevant years, and they contain a greater proportion of “harder” cases, i.e., those in which a judge’s ideology is more likely to be salient. Note, however, that “hard” is not equivalent to “ideological” in the traditional, political sense. Not all of these cases involve legal issues that traditionally divide liberals and conservatives; many of the en banc courts split over issues with less traditional ideological salience, such as technical or procedural legal questions. On average, however, we should expect stronger relationships between ideology scores and case outcomes for these hard cases as a whole.

Each of the 414 decisions, including dissents and any concurrences, was read, and each judge’s vote was classified as either conservative or liberal. Using these data, I conduct three tests: (1) goodness of fit; (2) a logit regression, comparing the respective normalized correlation coefficients; and (3) a receiver operating characteristic (ROC) curve, comparing the respective measures’ areas under the curve.

Predictive Validation Using En Banc Votes: Goodness of Fit

I first explore the goodness-of-fit between votes and judge ideology score for each of the four measures in turn. The Pearson’s product-moment correlation coefficients are: JuDJIS: $r = 0.408$ , $95\%$ CI $[0.378, 0.437]$ ; PAP: $r = 0.351$ , $95\%$ CI $[0.319, 0.382]$ ; JCS: $r = 0.323$ , $95\%$ CI $[0.290, 0.354]$ ; and CBI: $r = 0.264$ , $95\%$ CI $[0.215, 0.311]$ . Thus, a judge’s JuDJIS score explains significantly more variation in these data’s judge votes than the equivalent scores of the three existing measures do.

Predictive Validation Using En Banc Votes: Logit Regression

I next estimate a logit model, regressing votes (whether the judge voted for a conservative outcome) on the predictor (the ideology score of the judge, as respectively estimated by JuDJIS, PAP, JCS, and CBI). Table 1 and Figure 4 present the results. The coefficients indicate the marginal effects of a two-standard-deviation increase in conservativeness in each respective ideology measure on the probability of a conservative vote. (The overall incidence of conservative votes in the full data set is approximately 55.9%) Thus, for example, a judge with a JuDJIS ideology score of 0.41 is about 45.3% percentage points more likely to cast a conservative vote than a judge with a JuDJIS ideology score of $-$ 0.14. Each of the four scores are associated with votes at highly significant levels (with JuDJIS’s being the most statistically significant). But a two standard-deviation change in JuDJIS score is associated with a substantially greater change in probability of a conservative vote than the equivalent change for the other three scores.

Table 1 Logit predictions: Marginal effects of judge score on probability of casting a conservative en banc vote.

Note: Coefficients are normalized to indicate the change in probability associated with a two standard-deviation change in the given score. p scores are in parentheses.

Figure 4 Probability of a conservative vote as a function of judge ideology score, by measure: en banc cases.

Figure 4 graphically illustrates this difference. The four graphs plot, for each vote, the judge’s ideology score on the x-axis. The judge’s votes are plotted on the y-axis, with conservative votes (1) at the top and liberal ones (0) at the bottom. (They are vertically jittered to show density.) For each graph, a logit regression curve shows the relationship between the two variables. Though the correlations are positive and significant in all four cases, the difference in the relationships’ magnitude is evident from the shapes of the respective s-curves.

Predictive Validation Using En Banc Votes: ROC Analysis

To further gauge predictive validity, I next estimate a logistic regression and plot a ROC curve for each of the four measures. Long common in the diagnostic medicine literature and now often used in political research (Imai and Khanna Reference Imai and Khanna2016; Mueller and Rauh Reference Mueller and Rauh2018), an ROC curve is an arguably more-intuitive method for assessing how well a metric accurately classifies observations into binary outcomes. In this case, the predicted variable is whether the judge casts a conservative vote, as described above. The predicting variable is the ideology score of the judge, as respectively estimated by JuDJIS, PAP, JCS, and CBI.Footnote ¹² For every judge ideology threshold, the ROC curve plots the true positive rate against the false positive rate (Fischman Reference Fischman2011; Wang Reference Wang2019). The area under the curve (AUC) therefore represents each measure’s relative success at predicting votes (see Hanley and McNeil Reference Hanley and McNeil1982). Specifically, it denotes the probability that the given measure will rank a randomly chosen conservative vote as conservative instead of liberal.

The top panel of Figure 5 displays the ROC curves. The bottom panel gives the AUC values for each of the four measures. JuDJIS achieves an AUC value of 0.736; PAP’s AUC value is 0.672; JCS’s AUC value is 0.675; and CBI’s AUC value is 0.645. A DeLong test indicates that PAP is statistically indistinguishable from JCS ( $p = 0.793$ ) and marginally significantly higher than CBI ( $p = 0.115$ ). But JuDJIS performs significantly better than all three: since JCS is about 17.5 percentage points above a random classification and JuDJIS is 23.6 percentage points above random, JuDJIS’s performance represents an improvement of 34.9% over JCS’s performance. A DeLong test indicates that the difference between the two is highly significant ( $p = 6.10 \times 10^{-12}$ ).

Figure 5 Top: ROC curves comparing four measures’ success at predicting en banc votes; Bottom: Comparative areas under the ROC curve.

Robustness Analysis

Finally, to show these results’ robustness to different data and model specifications, I perform comparable analyses of a larger data set comprising three-judge panel decisions ( $n = 4,482$ ). While these cases disproportionately constitute “easy” cases, in which the members of the panel are unanimous over 95% of the time, I include them because they better reflect the run-of-the-mill decision-making of circuit judges. The results are substantially similar to those produced by the en banc cases, with JuDJIS performing better, albeit not quite as decisively, than the other three measures in each case. (See Figure A6.1 in the Supplementary Material for analysis and results).

For 21 head-to-head predictive comparisons across different data forms and model specifications, the JuDJIS Circuit Ideology scores predict outcomes more accurately than the other measure in all 21 of them, with the difference statistically significant in 20 of the 21. Together, this set of validations indicates that the JuDJIS Circuit Ideology data outperform all existing measures of circuit ideology—using a variety of metrics—in predicting how judges decide cases. And although JuDJIS’s accuracy in reflecting change over time cannot be tested against other federal appellate measures (because no others are currently dynamic), we would expect that this dynamism confers JuDJIS with additional advantages in predictive power and general validity.

4 Analyses of Federal Circuit Judge Ideology

In what follows, I present the Circuit Ideology scores in more detail.Footnote ¹³ To produce the Circuit Ideology scores, I applied the process described above for the ideology category for judges who served in an active capacity on the U.S. courts of appeals from 1990–2024,Footnote ¹⁴ including essentially all judgesFootnote ¹⁵ who served in such a capacity at any point between 1990–2019. Figure 6 displays summary statistics for the data set, aggregated, and disaggregated, by party of appointing president and by the judge’s gender.

Figure 6 JuDJIS circuit ideology descriptive statistics.

Note: Circles denote means for the given sample; gray bars denote 95% confidence intervals.

Figure 7 JuDJIS U.S. circuit judge ideologies with 90% confidence intervals, 1990–2024 (Tiers 1 and 2: #1-235).

Note: Includes the 1st and 2nd quartiles of judges, sorted from most liberal ( $-$ 1) to most conservative (+1) by average ideology score, averaged over the judge’s Court of Appeals tenure. Blue open circles denote Democratic president appointees; red closed circles denote Republican president appointees.

Figure 8 JuDJIS U.S. circuit judge ideologies with 90% confidence intervals, 1990–2024 (Tiers 3 and 4: #236-470).

Note: Includes the 3rd and 4th quartiles of judges, sorted from most liberal (-1) to most conservative (+1) by average ideology score, averaged over the judge’s Court of Appeals tenure. Blue open circles denote Democratic president appointees; red closed circles denote Republican president appointees.

At the judge level (i.e., a judge’s scores aggregated over full tenure), the mean ideology score is 0.133, the median score is 0.131, and the standard deviation is 0.344. Thus, the dataset exhibits a clear conservative slant. Figure 6 shows that this appears to stem from two phenomena: (1) Republican presidents have appointed most (57.8%) of the judges in the dataset; and (2) Democratic-appointed judges are more moderate, i.e., the mean ideology score of Republican-appointed judges (.319) is more conservative than the mean ideology score of Democrat-appointed judges ( $-$ .123) is liberal. This observation is consistent with recent findings that modern Republican judicial appointments have placed greater emphasis on ideology relative to Democratic appointments. (Copus, Hübert, and Pellaton Reference Copus, Hübert and Pellaton2025).

To further illustrate the variation between judges, Figures 7 and 8 provide the judicial ideology point estimate and 90% confidence intervals for all 470 judges in the JuDJIS Circuit Ideology data set, averaged over each judge’s Court of Appeals tenure. The judges are ordered from most liberal (top-left of Figure 7) to most conservative (bottom-right of Figure 8). As the figures show, the confidence intervals vary considerably between judges. As explained in Section 2 above, confidence in the ideology estimate is a function of two factors: the number of total comments the judge received over his or her tenure and the uniformity of those comments.Footnote ¹⁶ Thus, the judges with particularly large confidence intervals are almost always those with just one evaluation in the data set because they either had a very short circuit tenure, left the bench shortly after 1990, or joined the bench shortly before 2024. (For this last group, uncertainty about the estimate will likely decrease as evaluations from 2024-on are incorporated into the data set.) As the figures show, beginning with Judge George Edwards’s (6th Cir., 1963–1995) score of $-$ 0.80, the 470 ideology scores rise incrementally, ending with Judge Donald Russell’s (4th Cir., 1969–95) score of 0.93.

Nine eventual Supreme Court nominees are included in the data set based on their circuit court tenures.Footnote ¹⁷ One particularly notable score is the “0” assigned to Ruth Bader Ginsburg, who, before her 1993 elevation to the Supreme Court, served thirteen years on the D.C. Circuit on appointment by President Carter. Though such a moderate score may surprise some who know the late judge/justice as a liberal champion of women’s and other civil rights, her tenure as a circuit judge was, in fact, considered positively centrist. According to a 1987 empirical study of the D.C. Circuit, Judge Ginsburg was more likely to vote with Republican than Democratic appointees and generally opposed expanding corporate regulation (Lepore Reference Lepore2018). According to a 2018 biography in The New Yorker, UC Santa Barbara history professor Sherron De Hart described Ginsburg’s D.C. Circuit tenure as “something like a decontamination chamber,” in which Ginsburg was “rinsed and scrubbed of the hazard of her thirteen years as an advocate for women’s rights.” By 1993, the article observed, Ginsburg had been “sufficiently depolarized” for nomination to the high court (Lepore Reference Lepore2018). Her centrist JuDJIS score reflects these observations.

5 Conclusion

This article developed and introduced the Jurist-Derived Judicial Ideology Scores, the first dynamic method for systematically estimating the ideologies and other traits of nearly the entire federal judiciary. Derived from tens of thousands of qualitative evaluations, the method estimates the ideology of essentially every Article III U.S. federal judge serving since 1990. Not surprisingly given the quality of the content underlying the scores, JuDJIS ideology data predict case outcomes with significantly greater accuracy than any of the three leading circuit-judge ideology measures.

The analysis above suggests that experts’ observation of judging is a valid method for measuring ideology. It validates the assumption that legal practitioners and other experts have special insight into how judges decide cases, insight that cannot be captured as successfully by political phenomena such as the judicial-appointment process and judges’ own politicalbehavior.

I hope that JuDJIS’s four non-ideology measures, to be introduced in future work, will further demonstrate empirically the multi-dimensional character of the judging process. In addition to shedding light on important questions themselves, I hope that other findings like these will help to further close the theoretical and methodological gaps that still divide scholars studying how judges make decisions.

Acknowledgments

I thank participants at the 2024 meeting of the American Law & Economics Association, 2024 NYU School of Law External Law & Economics Workshop, 2019 University College Dublin Quantitative Text Analysis Dublin (QTA-DUB) Workshop, 2019 Hebrew University Empirical Study of Public Law & Human Rights Workshop, 2019 ETH Zurich Conference on Data Science and Law, 2019 Princeton University Political Economy and Public Law Conference, 2018 Meeting of the American Political Science Association (APSA), University of Michigan Inter-disciplinary Workshop in American Politics, and 2018 Conference on Empirical Legal Studies (CELS). I thank Megan Rosen, editor of the Almanac of the Federal Judiciary, for providing access to the archived Almanac files. I also thank Deborah Beim, Adam Chilton, Michael Gilbert, Joshua Fischman, Richard Hynes, Michael Nelson, Kevin Quinn, Kelly Rader, Kyle Rozema, Megan Stevenson, and Mariah Zeisberg for helpful comments. I especially thank Charles Crabtree and Adam Feldman for many constructive formative conversations. I thank Kevin Breiner, Ruixing Cao, Husnain Choudhry, Eddie Colombo, Danielle Gibbons, Jake Greenberg, Conor Hargen, Jeffrey Horn, Samuel Lin, Joseph Park, Vaghif Salem, Leighton Schnedler, Jacob Smith, and Latrell Williams for outstanding research assistance or website support. Finally, I profusely thank Li Zhang of the UVA Legal Data Lab for his tireless and invaluable assistance with text analysis over the last two years.

Data Availability Statement

Replication code for this article is available at Cope (2025). A preservation copy of the same code and data can also be accessed via Dataverse at https://doi.org/10.7910/DVN/KZ5BJF.

Competing Interest

The author is not aware of any competing interests.

Supplementary Material

The supplementary material for this article can be found at https://doi.org/10.1017/pan.2025.10009.

Author Biography

Kevin Cope is a Professor of Law and Public Policy, and Professor of Politics (by courtesy) at the University of Virginia.

Footnotes

Edited by: Jeff Gill

This article has been updated since its original publication. A notice detailing the changes can be found here: https://doi.org/10.1017/pan.2025.10020

1 As of 2025, the three leading judicial ideology measures—Epstein et al. (Reference Epstein, Martin, Segal and Westerland2007); Martin and Quinn (Reference Martin and Quinn2002); Segal and Cover (Reference Segal and Cover1989)—had collectively been used or cited in nearly 4,000 published courts studies.

2 I call this approach expert-sourced because of its resemblance to crowd-sourcing techniques (see, e.g., Benoit et al. Reference Benoit, Conway, Lauderdale, Laver and Mikhaylov2016), with the crowd in this case comprising, not the general public, but selected experts.

3 For a more detailed discussion of these measures’ underlying theoretical assumptions, strengths, and weaknesses, see Cope (Reference Cope, Epstein, Grendstad, Šadl and Weinshall2024).

4 For circuit ideology, the thresholds are: bigram: 45; trigram: 25; quadgram: 10; quintgram: 4; sexgram: 4; septgram: 4; octogram: 4; novagram: 4.

5 The intercoder reliability scores are as follows: for the initial decision on salience/non-salience (i.e., 99 or $-$ 3:3), the Krippendorff’s alpha intercoder reliability score for the three coders is 0.84. For the ngrams for which the three coders unanimously agreed on salience (41.2% of ngrams), the Krippendorff’s alpha intercoder reliability score (Landis and Koch Reference Landis and Koch1977) is 0.95 as to the exact value assigned.

6 Note that this process ensures that the reviews of evaluators who give lengthier comments are not given more weight than those who give shorter ones. For example, in the comment above, the method might assign each of the scored ngrams a $0$ , but the comment would contribute only one $0$ (the mean of $0,0,0$ ) to the evaluation score, not three $0$ s.

7 In fact, the ability to quantify uncertainty can reduce bias. Where values are estimated with uncertainty, treating the point estimates as precisely determined predictors in a model, rather than points in a distribution, creates measurement error.

8 For instance, the Segal and Cover scores (Segal and Cover Reference Segal and Cover1989) and the V-DEM initiative (Coppedge et al. Reference Coppedge2011) use different sets of evaluators for different observations. The V-DEM initiative evaluates democracy and related traits in over 200 countries, with each country’s scores generated by several of more than 3,700 country-specific experts (Coppedge et al. Reference Coppedge2011).

9 For JCS, the Senate blue-slip process described above allows a senator to influence federal judicial appointments in the senator’s home state, meaning that a senator’s subjective impression of a candidate’s ideology can affect which judges are nominated and confirmed. For instance, a liberal senator might make stereotypical assumptions about the liberal judicial ideology of a potential nominee (who, often would not have substantial existing judicial experience) based on the candidate’s (Black, Latino, or Indigenous, e.g.) race or (female) sex, and support her for that reason. That judge would then receive a liberal JCS score, even if she turned out to be more moderate, meaning that the senator’s demographic-based biases would have biased the JCS score leftward. Somewhat similarly for CBI, in hiring clerks, judges might make comparable assumptions about their potential clerks’ ideology based partly on the clerks’ race, gender, or expressed sexual orientation (cf., e.g., Vick and Cunningham Reference Vick and Cunningham2018). (While this last phenomenon would involve bias by judges instead of bias toward judges, it could nonetheless bias the scores similarly.)

10 Ono and Zillis's conjoint experiment found one instance of anti-non-White race-/ethnicity-based bias perception in the full sample: in immigration cases, respondents (driven by Republican respondents) were more likely to believe that a Hispanic judge would be biased toward the immigrant.

11 Bailey (Reference Bailey2007) develops a bridging solution that connects the Supreme Court, Congress, and the President over time, but that method is not available here.

12 For dynamic scores, a judge’s score is averaged over his/her tenure.

13 Section 6.4 of the Supplementary Material provides descriptive statistics of a sample of the JuDJIS District Ideology data.

14 A number of judges appointed by Presidents Trump and Biden between 2020 and 2025 do not appear in the data below because they had not yet been evaluated as of this article’s publication.

15 Nine judges served as active judges for brief periods between 1990 and 2019 but are not included in the data set. Of those nine, six judges served short stints on the court of appeals and therefore decided few or no cases: Michael Chertoff (3rd Circuit, commission 6/03, resigned to come Secretary of Homeland Security 2/05), John David Kelly (8th Circuit, commissioned 8/98, died 10/98), Susan Bieke Neilson (6th Circuit, commissioned 11/05, died 1/06), Charles Willis Pickering Sr. (Fith Circuit, commissioned after recess appointment 1/04, retired in light of failed confirmation 12/04), H. Lee Sarokin (Third Circuit, commissioned 10/94, retired 7/96), and David Hackett Souter (1st Circuit, commissioned 5/90, elevated 10/90). Three of the nine judges’ court-of-appeals tenures ended early in 1990 and therefore did not have robust evaluations published: Jean Galloway Bissell (Federal Circuit, died 2/90), John Joseph Gibbons (3rd Circuit, retired 1/90), and Clarence Thomas (D.C. Circuit, elevated 1/90).

16 In the rare cases where all comment scores are uniform (usually, for judges with very short tenures over the covered period), the standard error and confidence intervals are undefined and therefore missing.

17 See Figure A6.1 in the Supplementary Material for a comparison of those nine judges.

References

Albaugh, Q., Sevenans, J., Soroka, S., and Loewen, P. J.. 2013. “The Automated Coding of Policy Agendas: A Dictionary-Based Approach.” In 6th Annual Comparative Agendas Conference. Atnwerp.Google Scholar

Albaugh, Q., Soroka, S., Joly, J., Loewen, P., Sevenans, J., and Walgrave, S.. 2014. “Comparing and Combining Machine Learning and Dictionary-Based Approaches to Topic Coding.” In 7th Annual Comparative Agendas Project (CAP) Conference. Konstanz.Google Scholar

Bailey, M. A. 2007. “Comparable Preference Estimates Across Time and Institutions for the Court, Congress, and Presidency.” American Journal of Political Science 51 (3): 433–448.CrossRef Google Scholar

Bailey, M. A. 2017. “Measuring Ideology on the Courts.” In Routledge Handbook of Judicial Behavior, edited by Klein, D. and T. Tyler. 62–83. Abingdon: Routledge.CrossRef Google Scholar

Barber, M., and McCarty, N.. 2016. “Causes and Consequences of Polarization.” In Political Negotiation: A Handbook, edited by Mansbridge, J., and C. J. Martin, 39–90. DC: Brookings Institution Press. 39–43.Google Scholar

Benoit, K., Conway, D., Lauderdale, B. E., Laver, M., and Mikhaylov, S.. 2016. “Crowd-Sourced Text Analysis: Reproducible and Agile Production of Political Data.” American Political Science Review 110 (2): 278–295.CrossRef Google Scholar

Bonica, A., Chilton, A. S., Goldin, J., Rozema, K., and Sen, M.. 2017. “Measuring Judicial Ideology Using Law Clerk Hiring.” American Law and Economics Review 19 (1): 129–161.Google Scholar

Bonica, A., Chilton, A. S., and Sen, M.. 2016. “The Political Ideologies of American Lawyers.” Journal of Legal Analysis 8 (2): 277–335.CrossRef Google Scholar

Bonica, A., and Sen, M.. 2017. “A Common-Space Scaling of the American Judiciary and Legal Profession.” Political Analysis 25 (1): 114–121.CrossRef Google Scholar

Bonica, A., and Sen, M.. 2021. “Estimating Judicial Ideology.” Journal of Economic Perspectives 35 (1): 97–118.CrossRef Google Scholar

Boyd, C. L. 2011. Federal District Court Judge Ideology Data.Google Scholar

Cingranelli, D. L., and Richards, D. L.. 2010. “The Cingranelli and Richards (CIRI) Human Rights Data Project.” Human Rights Quarterly 32 (2): 401–424.CrossRef Google Scholar

Converse, P. E. 2006. “The Nature of Belief Systems in Mass Publics (1964).” Critical Review 18 (1–3): 1–74.CrossRef Google Scholar

Cope, K. L. 2023. “Measuring Law’s Normative Force.” Journal of Empirical Legal Studies 20 (4): 1005–1044.CrossRef Google Scholar

Cope, K. L. 2024. “The Conceptual Challenge to Measuring Ideology.” In The Oxford Handbook of Comparative Judicial Behaviour, edited by Epstein, L., Grendstad, G., Šadl, U., and Weinshall, K., 895–916. Oxford: Oxford University Press.Google Scholar

Cope, K. L. 2025. Replication Data for: An Expert-Sourced Measure of Judicial Ideology). Harvard Dataverse. https://doi.org/10.7910/DVN/KZ5BJF.Google Scholar

Cope, K. L., Crabtree, C., and Fariss, C. J.. 2020. “Patterns of Disagreement in Indicators of State Repression.” Political Science Research and Methods 8 (1): 178–187.CrossRef Google Scholar

Coppedge, M., et al. 2011. “Conceptualizing and Measuring Democracy: A New Approach.” Perspectives on Politics 9 (2): 247–267.CrossRef Google Scholar

Copus, R., Hübert, R., and Pellaton, P.. 2025. “Trading Diversity? Judicial Diversity and Case Outcomes in Federal Courts.” American Political Science Review 119 (2): 832–846.CrossRef Google Scholar

Epstein, L., Martin, A. D., and Quinn, K.. 2024. “Measuring Political Preferences.” In The Oxford Handbook of Comparative Judicial Behaviour, edited by Epstein, L., G. Grendstad, U. Šadl, and K. Weinshall, 325–44. Oxford: Oxford University Press.CrossRef Google Scholar

Epstein, L., Martin, A. D., Segal, J. A., and Westerland, C.. 2007. “The Judicial Common Space.” The Journal of Law, Economics, & Organization 23 (2): 303–325.CrossRef Google Scholar

Epstein, L., Walker, T. G., and Dixon, W. J.. 1989. “The Supreme Court and Criminal Justice Disputes: A Neo-Institutional Perspective.” American Journal of Political Science 33 (4): 825–841.CrossRef Google Scholar

Farah, H. A., and Kakisim, A. G.. 2023. “Enhancing Lexicon Based Sentiment Analysis Using n-gram Approach.” In Smart Applications with Advanced Machine Learning and Human-Centred Problem Design, 213–221, edited by Xhafa, F., Cham: Springer.CrossRef Google Scholar

Fischman, J. B. 2011. “Estimating Preferences of Circuit Judges: A Model of Consensus Voting.” The Journal of Law and Economics 54 (4): 781–809.CrossRef Google Scholar

Fischman, J. B., and Law, D. S.. 2009. “What is Judicial Ideology, and How Should We Measure It.” Wash. UJL & Pol’y 29: 133.Google Scholar

Gaudet, H., and St. John, C. W.. 1933. “Individual Differences in the Sentencing Tendencies of Judges.” The Journal of Criminal Law and Criminology 23: 811.Google Scholar

Gerring, J. 1997. “Ideology: A Definitional Analysis.” Political Research Quarterly 50 (4): 957–994.CrossRef Google Scholar

Giles, M. W., Hettinger, V. A., and Peppers, T.. 2001. “Picking Federal Judges: A Note on Policy and Partisan Selection Agendas.” Political Research Quarterly 54 (3): 623–641.CrossRef Google Scholar

Grendstad, G., Shaffer, W. R., and Waltenburg, E. N.. 2012. “Ideologi og Grunnholdninger hos Dommerne i Norges Høyesterett.” Lov og Rett 51 (4): 240–253.CrossRef Google Scholar

Grimmer, J., and Stewart, B. M.. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–297.CrossRef Google Scholar

Grinstein, M. 2020. “The Fifth Circuit Four.” The History Teacher 54 (1): 155–179.Google Scholar

Gwartney, J., Lawson, R., Hall, J., and Murphy, R.. 2021. “Economic Freedom of the World-2021 Annual Report.” The Fraser Institute.CrossRef Google Scholar

Hanley, J. A., and McNeil, B. J.. 1982. “The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve.” Radiology 143 (1): 29–36.CrossRef Google Scholar

Harcourt, B. E. 2007. “Judge Richard Posner on Civil Liberties: Pragmatic Authoritarian Libertarian.” The University of Chicago Law Review 74: 1723.Google Scholar

Imai, K., and Khanna, K.. 2016. “Improving Ecological Inference by Predicting Individual Ethnicity From Voter Registration Records.” Political Analysis 24 (2): 263–272.CrossRef Google Scholar

Jolly, S., et al. 2022. “Chapel Hill Expert Eurvey Trend Fiile, 1999–2019.” Electoral Studies 75: 102420.CrossRef Google Scholar

Lammon, B. D. 2009. “What We Talk About When We Talk About Ideology: Judicial Politics Scholarship and Naive Legal Realism.” St. John’s Law Review 83: 231.Google Scholar

Landis, J. R., and Koch, G. G.. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 33 (1): 159–174.CrossRef Google Scholar PubMed

Lepore, J. 2018. “Ruth Bader Ginsburg’s Unlikely Path to the Supreme Court.” The New Yorker 34–41.Google Scholar

Martin, A. D., and Quinn, K. M.. 2002. “Dynamic Ideal Point Estimation via Markov Chain Monte Carlo for the U.S. Supreme Court, 1953–1999.” Political Analysis 10 (2): 134–153.CrossRef Google Scholar

McMillion, B. J. 2017. The Blue Slip Process for US Circuit and District Court Nominations: Frequently Asked Questions. Washington, DC: Congressional Research Service.Google Scholar

Mueller, H., and Rauh, C.. 2018. “Reading Between the Lines: Prediction of Political Violence Using Newspaper Text.” American Political Science Review 112 (2): 358–375.CrossRef Google Scholar

Nagel, S. S. 1961. “Political Party Affiliation and Judges’ Decisions.” American Political Science Review 55 (4): 843–850.CrossRef Google Scholar

Ono, Y., and Zilis, M. A.. 2022. “Ascriptive Characteristics and Perceptions of Impropriety in the Rule of Law: Race, Gender, and Public Assessments of Whether Judges can be Impartial.” American Journal of Political Science 66 (1): 43–58.CrossRef Google Scholar

Pettigrew, T. F., and Tropp, L. R.. 2013. “Does Intergroup Contact Reduce Prejudice? Recent Meta-analytic Fiindings.” In Reducing prejudice and discrimination. In Reducing Prejudice and Discrimination, edited by Oskamp, S., 93–114. Mahwah, New Jersey: Lawrence Erlbaum Associates.Google Scholar

Poole, K. T., and Rosenthal, H.. 1985. “A Spatial Model for Legislative Roll Call Analysis.” American Journal of Political Science 29 (2): 357–384.CrossRef Google Scholar

Poole, K. T., and Rosenthal, H.. 2000. Congress: A Political-Economic History of Roll Call Voting. Oxford: Oxford University.Google Scholar

Rohde, D. W., and Spaeth, H. J.. 1976. Supreme Court Decision Making. San Francisco: WH Freeman.Google Scholar

Rosen, M., ed. 2023. Almanac of the Federal Judiciary: Profiles and Evaluations of All Judges of the United States Circuit Courts and the United States Supreme Court. New York: Wolters Kluwer.Google Scholar

Schubert, G. A. 1960. Quantitative Analysis of Judicial Behavior. Glencoe, Illinois: Free Press.Google Scholar

Segal, J. A., and Cover, A. D.. 1989. “Ideological Values and the Votes of U.S. Supreme Court Justices.” American Political Science Review 83 (2): 557–565.CrossRef Google Scholar

Sen, M. 2014a. “How Judicial Qualification Ratings May Disadvantage Minority and Female Candidates.” Journal of Law and Courts 2 (1): 33–65.CrossRef Google Scholar

Sen, M. 2014b. “Minority Judicial Candidates Have Changed: The ABA Ratings Gap Has Not.” Judicature 98: 46.Google Scholar

Slotnick, E. E. 2006. “Appellate Judicial Selection During the Bush Administration: Business as Usual or a Nuclear Winter.” Arizona Law Review 48: 225.Google Scholar

Spaeth, H., Epstein, L., Ruger, T., Whittington, K., Segal, J., and Martin, A. D.. 2014. “Supreme Court Database Code Book.” http://scdb.wustl.edu. Google Scholar

Vick, A. D., and Cunningham, G.. 2018. “Bias Against Latina and African American Women Job Applicants: A Field Experiment.” Sport, Business and Management: An International Journal 8 (4): 410–430.CrossRef Google Scholar

Voeten, E. 2007. “The Politics of International Judicial Appointments: Evidence from the European Court of Human Rights.” International Organization 61 (4): 669–701.CrossRef Google Scholar

Wang, Y. 2019. “Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data: A Comment.” Political Analysis 27 (1): 107–110.CrossRef Google Scholar

Wijtvliet, W., and Dyevre, A.. 2021. “Judicial Ideology in Economic Cases: Evidence from the General Court of the European Union.” European Union Politics 22 (1): 25–45.CrossRef Google Scholar

Windett, J. H., Harden, J. J., and Hall, M. E.. 2015. “Estimating Dynamic Ideal Points for State Supreme Courts.” Political Analysis 23 (3): 461–469.CrossRef Google Scholar

Figure 1 Structure of document components (evaluation-comment-ngram).

Figure 2 Distribution of JuDJIS ideology comment scores for sample of eight circuit judges, aggregated over their tenures. Note: Vertical dashed lines denote the mean score. Gray bands denote 90% confidence intervals around the mean score.

Figure 3 Matrix of pairwise scatterplots: JuDJIS vs. PAP vs. JCS vs. CBI measures.Note: Observations denote judges (averaged over each judge’s Court of Appeals tenure) 1990–2024; red ‘R’ = Republican-appointed; blue ‘D’ = Democratic-appointed. Top-left r values indicate correlation coefficient of two score sets.

Table 1 Logit predictions: Marginal effects of judge score on probability of casting a conservative en banc vote.

Figure 4 Probability of a conservative vote as a function of judge ideology score, by measure: en banc cases.

Figure 5 Top: ROC curves comparing four measures’ success at predicting en banc votes; Bottom: Comparative areas under the ROC curve.

Figure 6 JuDJIS circuit ideology descriptive statistics.Note: Circles denote means for the given sample; gray bars denote 95% confidence intervals.

Figure 7 JuDJIS U.S. circuit judge ideologies with 90% confidence intervals, 1990–2024 (Tiers 1 and 2: #1-235).Note: Includes the 1st and 2nd quartiles of judges, sorted from most liberal ($-$1) to most conservative (+1) by average ideology score, averaged over the judge’s Court of Appeals tenure. Blue open circles denote Democratic president appointees; red closed circles denote Republican president appointees.

Figure 8 JuDJIS U.S. circuit judge ideologies with 90% confidence intervals, 1990–2024 (Tiers 3 and 4: #236-470).Note: Includes the 3rd and 4th quartiles of judges, sorted from most liberal (-1) to most conservative (+1) by average ideology score, averaged over the judge’s Court of Appeals tenure. Blue open circles denote Democratic president appointees; red closed circles denote Republican president appointees.

Cope supplementary material

File 3.7 MB

An Expert-Sourced Measure of Judicial Ideology - ERRATUM

Kevin L. Cope Kevin L. Cope

Political Analysis

Article contents

An Expert-Sourced Measure of Judicial Ideology

Abstract

Keywords

Information

1 Measuring Judicial Ideology

1.1 An Expert-Sourced Measure of U.S. Federal Judges

1.2 Behavioral Model and Theoretical Assumptions

2 Hierarchical Ngram Text Analysis

2.1 The Almanac

2.2 Text-Analysis: Hierarchical Ngram Analysis

2.3 Advantages and Potential Critiques

3 Method Validation

3.1 Data Comparison

3.2 Predicting Case Outcomes

Predictive Validation Using En Banc Votes: Goodness of Fit

Predictive Validation Using En Banc Votes: Logit Regression

Predictive Validation Using En Banc Votes: ROC Analysis

Robustness Analysis

4 Analyses of Federal Circuit Judge Ideology

5 Conclusion

Acknowledgments

Data Availability Statement

Competing Interest

Supplementary Material

Author Biography

Footnotes

References

Cope supplementary material

A correction has been issued for this article:

Linked content

This article has been cited by the following publications. This list is generated based on data provided by Crossref.

Article contents

An Expert-Sourced Measure of Judicial Ideology

Abstract

Keywords

Information

1 Measuring Judicial Ideology

1.1 An Expert-Sourced Measure of U.S. Federal Judges

1.2 Behavioral Model and Theoretical Assumptions

2 Hierarchical Ngram Text Analysis

2.1 The Almanac

2.2 Text-Analysis: Hierarchical Ngram Analysis

2.3 Advantages and Potential Critiques

3 Method Validation

3.1 Data Comparison

3.2 Predicting Case Outcomes

Predictive Validation Using En Banc Votes: Goodness of Fit

Predictive Validation Using En Banc Votes: Logit Regression

Predictive Validation Using En Banc Votes: ROC Analysis

Robustness Analysis

4 Analyses of Federal Circuit Judge Ideology

5 Conclusion

Acknowledgments

Data Availability Statement

Competing Interest

Supplementary Material

Author Biography

Footnotes

References

Cope supplementary material

A correction has been issued for this article:

Linked content

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests