Hostname: page-component-5b777bbd6c-7sgmh Total loading time: 0 Render date: 2025-06-24T11:56:56.393Z Has data issue: false hasContentIssue false

Balancing as a Means of Judicial Activism? Analysis of the German Federal Constitutional Court’s Use of Balancing Language

Published online by Cambridge University Press:  16 May 2025

Kilian Lüders*
Affiliation:
Faculty of Law, Humboldt-Universität zu Berlin, Berlin, Germany

Abstract

Many constitutional courts use balancing in constitutional right adjudication. However, critics argue that balancing is an (self-)empowerment of the courts and a tool of judicial activism. It is claimed that constitutional courts are increasingly using this technique when ruling against the legislature, for example when striking down laws. This study empirically examines the status of balancing in the case law of the German Federal Constitutional Court. It demonstrates that text-as-data methods can be used to analyze judicial reasoning by using word embeddings to measure the use of balancing language. It is shown that the use of balancing language increased during the first fifty years of the court’s existence. Since then, there has been a decline. The court also tends not to use more balancing language in decisions overturning laws. This evidence challenges the critique’s assumption that balancing is a tool of judicial activism.

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press on behalf of the German Law Journal

A. Introduction

Throughout the twentieth century, it became an established practice for many constitutional courts to deploy balancing in constitutional rights adjudication. In particular, the German Federal Constitutional Court (“GFCC”) stands out as a pioneer of the proportionality test.Footnote 1 Given that several other constitutional courts have adopted the proportionality test, the GFCC is considered particularly influential.Footnote 2 While some authors enthusiastically describe this development as a global spread of fundamental rights protection,Footnote 3 others are rather critical of this development. They argue balancing of being a technique used for judicial activism.Footnote 4 According to this view, balancing allows constitutional courts too much leeway in justifying their decisions, leading to judicial self-empowerment. As a result, these authors argue, constitutional courts would resort to balancing when they want to assert themselves against the legislature.

It is important to note that balancing is not a typical method employed by courts in their decisions. Judicial reasoning is usually categorical and subsuming under abstract definitions—for example, stating whether a contract is valid or whether a planning application has been wrongly refused. It is therefore an exceptional form of reasoning for courts that is of interest here. While the use of balancing by courts has received a great deal of attention in the academic literature, there is currently very little empirical evidence on how courts utilize this tool.Footnote 5 What remains insufficiently substantiated in this context are the accusations that constitutional courts are increasingly using balancing as a tool for judicial activism. One particularly powerful form of judicial activism occurs when the court opposes the legislature by overturning a law. But is it empirically true that constitutional courts use balancing more often when they strike down legislation?

This Article seeks to address these questions by empirically examining the use of balancing language in the GFCC jurisprudence. Balancing language refers to the typical phrases and language used by the court when balancing.Footnote 6 To empirically measure balancing language, state-of-the-art text-as-data methods are used to automatically analyze 3,274 decisions. This involves creating a word embedding model that is used to scale balancing language. The use of this method for the study of judicial argumentation techniques is new, which is why a particularly detailed validation is carried out.

Thus, it can be shown empirically that the use of balancing language by the GFCC increased during the first fifty years of the Court’s existence. Since then, however, it has tended to decline. Also, the hypothesis that more balancing language is used in decisions in which the court strikes down legislation cannot be confirmed.

This empirical study contributes to the literature in several ways: It adds new empirical insights to the debate on the use of balancing in constitutional right adjudication and the critique of judicial activism. At the same time, it also contributes to the study of judicial behavior by examining the strategic use of argumentative means by courts in their decisions. Finally, this Article also adds to the methodological text-as-data literature insofar as methods are applied to new text genres and research questions. This is one of the first attempts to use scaling methods to study argumentation techniques in case law.

The remainder of the Article is organized as follows: First, Section B introduces the debate on balancing and the hypotheses. Second, Section C presents the data and methodology for measuring balancing language. Third, Section D provides an extensive validation of the newly introduced measurement. Finally, Section E tests the hypotheses based on the newly created data.

B. Balancing

Empirical research on judicial behavior and separation of powers has made a significant contribution to explaining the strategic behavior of constitutional courts.Footnote 7 Even for the GFCC, a growing number of studies show how the court behaves as a strategic, and at times activist, actor.Footnote 8 Despite the variety of research approaches,Footnote 9 the empirical study of the judicial tools used by courts is still in its infancy.Footnote 10 There is a gap in the research on how certain techniques of constitutional reasoning are used by the courts.Footnote 11 One such argumentative technique that has received considerable attention is balancing. This Article attempts to empirically examine the strategic use of balancing as a specific constitutional argumentative technique that has been the subject of controversy in legal scholarship for some time. The conceptual framework of the study will be presented in three steps. First, the current debate and criticism of constitutional balancing will be summarized, and the hypotheses derived in Section B.I. Second, the empirical research on how the GFCC uses balancing will be discussed in Section B.II. Third, the object of the study will be specified by clarifying what is meant by balancing language in Section B.III.

I. Theory and Hypotheses

The debate on balancing in law unfolded over the course of the twentieth century,Footnote 12 and the age of constitutional balancing has been diagnosed for more than thirty years now.Footnote 13 The German Federal Constitutional Court, as the inventor of the proportionality test, receives particular attention in this context.Footnote 14 The proportionality test is a four-step review process originally introduced to control state interference with individual liberties: The legitimate aim, the suitability as well as the necessity of the chosen means and, finally, the appropriateness are examined, with the final step being a balancing process.Footnote 15 The proportionality test was adopted—albeit in modified form—by other constitutional courts, which was the starting point for a debate on the globalization of constitutional law.Footnote 16 Today, the principle of proportionality is often taken for granted,Footnote 17 as an indispensable tool in the review of constitutional rights or even as a basic element of the rule of law.Footnote 18

Despite its established status, proportionality has always been critically commented on, especially because of balancingFootnote 19 and especially in regard to the GFCC.Footnote 20 It is argued that balancing is not a rational legal argumentation technique because there can be no fixed legal standards.Footnote 21 Weighing is always necessary and should be done by the democratically legitimated legislature and not by the courts.Footnote 22 Balancing thus opens up certain freedoms for the courts and invites judicial activism. In this sense, the criticism of the GFCC is quite specifically that it uses balancing or the proportionality test to give itself more room for maneuver in decision-making vis-à-vis politics.Footnote 23

This critical view is not shared by everyone in the literature. Many see proportionality as just another legal principle of German public lawFootnote 24 that enables the mediation of individual rights and public interests.Footnote 25 It is argued that proportionality is also equipped with nuanced internal standards. Depending on the fundamental right in question and the characteristics of the case, a “rich casuistry” facilitates differentiated standards of review.Footnote 26 The argumentative approaches can differ considerably. For example, cases involving the principle of equality found in Art. 3(1) of the Basic Law, can be reviewed as a mere check on arbitrariness, but can also lead to a detailed four-step review.Footnote 27 The proportionality test and balancing are therefore not generally suspected in the German public law discourse. They are often seen as an obvious strategy, which do not necessarily imply a strategic motive.

The debate on the use of balancing by the GFCC is thus quite controversial. Many aspects of the debate involve very technical legal issues and raise questions about how the court should argue. However, the position of the critics who assume a link between the judicial decision-making technique and judicial activism can be understood as an empirical claim that can be examined:Footnote 28 Do courts actually use certain argumentation techniques strategically when they want to realize power vis-à-vis the legislature? Is there more balancing in such confrontational decisions?

Empirical research can contribute to this debate by shedding light on the actual use of balancing. Of course, this can only be one aspect of the broader debate on judicial activism. Judicial activism can take many different forms. In addition to balancing, there may be other strategies for the court to be activist, and various forms of judicial discretion can be distinguished. This raises a whole range of potential empirical research questions. However, this Article will focus on balancing, as this is an issue of particular interest in the global debate, especially with regard to the GFCC.

It is important to note that balancing goes beyond the proportionality test. Balancing is not only found in structured schemes, such as the proportionality test but is also often used freely, in an unstructured form, by the Court.Footnote 29 Thus, all explicit mentions of balancing methods, as well as all more implicit, weighting formulations and uses of language, should be included in an examination of the strategic use of balancing as a tool of judicial activism.Footnote 30

When it comes to judicial activism, or more fundamentally measuring the Court’s decision-making, the empirical literature on the GFCC often focuses on the Court’s power to overturn legislation.Footnote 31 In these decisions, judicial activism is particularly evident because the court is ruling against a legislative body that has been legitimated by democratic elections. It is, therefore, a direct demonstration of power against the legislatureFootnote 32 to which the GFCC is legally entitled under § 31 of BVerfGG, the German Constitutional Court Act. This study builds on the empirical literature and considers decisions overturning laws and/or declaring them unconstitutional as an expression of judicial activism. This understanding of judicial activism, combined with the critique of balancing as a tool of judicial activism, leads to the first hypothesis:

H1: In decisions overturning laws, the Court uses more balancing language.

The next hypothesis concerns the development over time. Critics often accuse the court of increasingly using balancing.Footnote 33 Two different but compatible theories support this claim. The first theory is concerned with the influence of the institutional strength and position of the court in the political environment on its decision-making.Footnote 34 In the early Federal Republic, the GFCC was a new institution that first had to attain its position of power.Footnote 35 Accordingly, it is to be expected that in the majority of decisions, the Court initially adopted a more cautious approach in its reasoning, which also applies in particular to newer techniques such as the proportionality test.

The second line of reasoning is based on the practice of fundamental rights interpretation. As the GFCC has gradually established a very broad and far-reaching protection of fundamental rights, ultimately protecting any form of human action,Footnote 36 the need to deal with conflicts of fundamental rights has increased over time.Footnote 37 It makes sense to use balancing to resolve such conflicts, as none of the conflicting fundamental rights has to be restricted, but a case-specific compromise can be found.Footnote 38 The dynamics of fundamental rights jurisprudence thus also speak in favor of an increase in the use of balancing language. Both strands of argumentation on temporal dynamics lead to the second hypotheses:

H2: The Court increasingly uses balancing language over time.

Both hypotheses, thus, allow us to empirically test important assumptions of the critique of the GFCC and shed light on how certain legal argumentation techniques are used by the court.

II. Empirical Research

Although research on balancing and proportionality is extensive, empirical research is rather limited.Footnote 39 In the following, the main empirical contributions on the use of balancing by the GFCC are discussed.

The most important contribution is undoubtedly the comparative study by Petersen.Footnote 40 It shares with the present study the guiding question of whether constitutional courts use balancing to promote judicial activism.Footnote 41 Methodologically, an argumentation analysis is conducted for constitutional courts in Germany, Canada, and South Africa. For the analysis, Petersen has created a category system of argumentation types, which also includes the steps of the proportionality test.Footnote 42 The object of Petersen’s study on the GFCC are decisions in which laws have been declared invalid.Footnote 43 For the GFCC, Petersen observes great variability over time and concludes that balancing is the dominant, but not the only, decisive argumentation technique.Footnote 44 However, at no time was it the case that the majority of cases examined were decided on the basis of balancing. Rather, the tendency of the Court to limit itself in balancing can be observed, which is why its use is far less problematic than the criticism accuses it of being.Footnote 45 Petersen’s study is easily one of the best empirical works on the GFCC. However, the selection of cases (n=241) leads to limitations. Petersen can only make statements on those cases in which laws are rejected.Footnote 46 It is, therefore, not possible to make an empirical statement on the basis of Petersen’s data as to whether there is more balancing in decisions dismissing legislation than in others.

A second empirical study of the GFCC is published by Lang.Footnote 47 The main interest of the work is the contexts of application of the proportionality test. Based on a pre-selection through a keyword search, 114 decisions between 2000 and 2017 were analyzed manually according to a system of categories.Footnote 48 Again, there are limitations in the case selection, which is restricted to decisions with typical keywords of the proportionality test.

In a third noteworthy study, again on the GFCC, Stohlmann and his colleagues randomly selected 300 decisions and annotated the proportionality test at the sentence level.Footnote 49 This resulted in very fine-grained data on the occurrence and internal structure of the test.Footnote 50 The Article aims to describe how the proportionality test and its application have evolved. However, this approach is not concerned with balancing as such, only as a part of the proportionality test.

Apart from research on the GFCC, it can currently be observed that more and more innovative research designs are devoted to balancing in court decisions. For example, Steiner and her colleagesFootnote 51 use an experiment to investigate the relevance of balancing in the proportionality test. Zufall and her colleagesFootnote 52 design a formal model to capture balancing between selected fundamental rights of the EU Charter of Fundamental Rights (“EUCh”).

The present study also relies on novel research designs by using methodological text-as-data innovations. They allow the automated analysis of the entire body of GFCC case law and, thus, comparisons between decisions rejecting and not rejecting legislation.

III. Balancing Language

It is often disputed what exactly is meant by balancing.Footnote 53 The literature on the GFCC, therefore, often focuses on the—explicit—proportionality test,Footnote 54 without taking into account that the Court also balances outside the proportionality test. In contrast, the object of this study, the balancing language of the GFCC, uses a conceivably broad understanding of balancing. It understands balancing as an argumentative technique that can be recognized by certain legal language, typical phrases and formulations.Footnote 55 What matters here is the language used by the court not the presence of a full proportionality test.

In some cases, the court also explicitly states that it is “balancing” or “weighting.” Thus, in the case law one can find formulations such as: “When weighing up the public interest which the provision serves against the severity of the encroachment, the provision is found to be (only just) reasonable in general.”Footnote 56 Such explicit references also include other key terms, such as the naming of the Verhältnismäßigkeitsprüfung (proportionality test) or its balancing step Angemessenheit (appropriateness).

However, it cannot be assumed that the GFCC always names and makes explicit its own methodological approach. Balancing or weighting is often carried out without being named as such. In this sense, the measurement also needs to find expressions of the execution of balancing. Such formulations could be as follows: “The more concretely and directly a legal interest is placed in danger by an expression of opinion, the less stringent are the requirements when it comes to an encroachment; the more indirect and distant the threatening violations of legal interests remain, the greater are the requirements to be made.”Footnote 57 Accordingly, the language of balancing also includes relational formulations, which can be identified by keywords such as umso, desto (the more), or größer (greater).

This operationalization of balancing, thus, targets both explicit naming and relational execution. At the same time, certain linguistic expressions and forms of argumentation, such as categorical or deductive ones, are excluded. The aim of the next section is to introduce a heuristic with which balancing language can be measured in all senate decisions of the GFCC. A new dataset is, therefore, created to test the hypotheses, whereby the evaluation of the entire case law allows comparisons between different groups of cases and is thus able to close the identified research gap.

C. Methods: Scaling Balancing Language

Scaling text means automatically locating the text on a meaningful scale.Footnote 58 Pioneers in the use of these techniques for empirical research include political science, which has developed procedures to position political texts on ideological left-right scales.Footnote 59 A comparatively new heuristic is the use of word embeddings for these purposes. Word embeddings are used not only in political science for political right-left scales,Footnote 60 but also for sentiments,Footnote 61 or in sociology to capture stereotyping.Footnote 62

Text-as-data methods are now increasingly used for legal textsFootnote 63 and tasks, such as locating decisions on political ideology scales.Footnote 64 Some work is already using word embeddings for more innovative decision scaling, such as Dyevre,Footnote 65 who locates GFCC decisions on a pro/anti-EU scale, or Ash et al.Footnote 66 to measure judicial sentiments. In general, however, it is quite difficult to scale judicial decisions, which is why questionable proxies are often used.Footnote 67

This Article aims to scale the intensity of the use of a legal reasoning technique. It is an attempt to investigate latent semantic scaling procedures for a new research object, legal argumentation. After introducing the data and the main independent variable in Section C.I., the functioning of the word embeddings will be explained and the model specifically created for the GFCC will be presented in Section C.II. Finally, the scaling procedure will be introduced in Section C.III.

I. Data

The project uses the GFCC full-text corpus.Footnote 68 The research questions of the Article focus on the decisions of the senates—excluding chamber decisions. Accordingly, all senate decisions of the Court between 1951 and the year 2021 that have been published in the official collection and have a substantial reasoning have been selected. However, ninety very atypical decisions were not taken into account. Decisions with less than forty and more than 10,000 tokens justification are excluded. On this basis, 3,274 decisions are analyzed. The hypotheses refer to the legal reasoning of the court, so the presentation of the facts, including the statements by the parties, in the decisions is not taken into account for any of the variables used in this Article. As pre-processing, the text was segmented into tokens using spaCy, a software library for natural language processing, and stop words were removed.Footnote 69 The corpus comes with a meta-datasetFootnote 70 that provides information on the date of the decision, the senate as well as the type of proceeding.

However, the independent variable that captures whether or not a law is overturned in a decision had to be created. The unconstitutionality of a law is determined by the GFCC in the introductory statement. In doing so, the court uses standardized language. It writes that a law is unvereinbar (incompatible) with the Basic Law and is, therefore, nichtig (void). Based on these typical expressions, regex rules were created to search for the relevant decision passages.Footnote 71 Accordingly, in 559 of the total 3,274 decisions, laws were at least partially overturned.

II. Word Embeddings

Measuring balancing language, the dependent variable, is more difficult because the language used by the court is more diverse. Word embeddings offer a solution to this problem. They are a technique that represents words as vectors, which means locating them in a high-dimensional vector space.Footnote 72 Each word is represented as a vector, with the geometric relationship of the vectors to each other representing semantic meaning.Footnote 73 For example, words with similar meanings are closer together in vector space. The information that can be incorporated into word embedding models can be quite rich, even allowing analogy inference.Footnote 74

For empirical social science, it is particularly relevant that word embeddings can be used to capture latent constructs, such as political ideology.Footnote 75 In practice, this is done on the basis of so-called seed words, terms that ideally represent the language use of interest. A new vector representing the abstract construct can then be calculated from the vectors of the seed words. This new vector can be used to measure latent dimensions of meaning, such as gender stereotypes, political ideologies or balancing language.Footnote 76 Figure 1 schematically illustrates how a vector is created from the vectors of the seed words intensiv (intensive), desto (the more), Verhältnismäßigkeitsprinzip (principle of proportionality), and Abwägung (balancing), which is supposed to represent abstract balancing language.

Figure 1. Conceptual plot of different word vectors. A new vector, an abstract representation of balancing, is created from four seed word vectors: Intensiv (intensive), desto (the more), Verhältnismäßigkeitsprinzip (principle of proportionality), and Abwägung (balancing).

The quality of this approach depends largely on the suitability of the word representations, that is, the language model used. As this project aims to scale legal texts, the word embeddings model must also be able to represent legal language correctly.Footnote 77 To this end, a custom model was created based solely on GFCC jurisprudence to represent as accurately as possible the linguistic practice of this particular court.

To create a word embedding model, a neural network is used, which is given the task of predicting the correct word from the context of its use.Footnote 78 To do this, arbitrary sentences are taken from the GFCC case law, in which individual words are masked. The neural network is then instructed to infer these words from the context and learn from its incorrect predictions. During this training, which is also repeated for all words, the model optimizes its predictions and obtains a 200-dimensional vector representation for each word. The GFCC model was trained using the implementation of the software library Gensim.Footnote 79

A simple face validity check can be used to verify that the training was successful. This is done by listing the most similar vectors of a key term and checking whether there is also a semantic relationship in the resulting terms. Similarity in this context means the geometric cosine similarity of two word vectors.

In order to test the model training, a term that has a specific technical meaning in the constitutional rights adjuncts of the GFCC, namely Willkürverbot (interdiction of arbitrariness),Footnote 80 is used to see if the most similar words provided by the newly trained model make sense. As the most similar vector, the model first outputs the same word in a different grammatical case (Willkürverbots).Footnote 81 Next, come the terms Verhältnismäßigkeitserfordernissen (proportionality requirement) and Gleichheitssatz (principle of equality). Thereby, the language model shows that it grasps genuinely doctrinal relationships from the case law because the court examines the general principle of equality based in Article 3.1 of the Basic Law, with both the proportionality test and the arbitrariness test.Footnote 82 The next most similar words Willkür (arbitrariness) and Verstoß (infringement) are also plausible in context and suggest that the training was successful. Based on the new GFCC word embedding model, the next step is to introduce the methodology for implementing the scaling.

III. Measures

For the implementation, two seed word lists and two methods for scaling text passages are proposed in the following, resulting in four different measurements. Only after the subsequent validation, mentioned in Section D.I., will it be decided which measurement is the best and will be used to test the hypotheses. Two independent seed word lists were created using two approaches. The first is a minimalist list consisting of only four seed words: Intensiv (intensive), desto (the more), Verhältnismäßigkeitsprinzip (principle of proportionality), and Abwägung (balancing). These four words were taken from the example sentences discussed above in the introduction to the concept of balancing language, discussed in Section B.III. Thereby, this approach limits the task of selecting seed words to a minimum and relies heavily on the generalization through the embedding model.

For the second list, an extended list of seed words was created. Because the selection requires excellent domain knowledge, the seed words were chosen in discussions with two professors of German public law. Both can be considered experts in GFCC jurisprudence, as both have been involved in proceedings before the court. In order to select the words, the professors were first asked to compile as many terms as possible that they felt represented balancing in the jurisprudence. This list was supplemented by word suggestions from Bomhoff.Footnote 83 With this initial list of words, random sentences were drawn from the case law and then examined to see if the court had actually balanced. False terms were removed. The final extended seed word list contains fifteen terms.Footnote 84 The seed word lists are each used to create balancing vectors, which abstractly represent balancing language.

Two methods are used to scale the decision texts. In both cases, the scaling is done at the paragraph level. Paragraphs represent closed argument units structured by the court. The first method uses the word embedding model to obtain an extended and improved dictionary.Footnote 85 This is done by taking the sixty most similar terms for the newly created balancing vectors. The Appendix contains the two dictionaries, which also give an initial face validity that the procedure can indeed capture balancing language. The next step is to count the number of occurrences of the words in the dictionary for each paragraph. Thus, the method used is a dictionary approach, whereby the word embedding model improves the word selection, that is, reduces the selection bias.

The second method uses word embeddings to also represent the paragraphs as vectors.Footnote 86 A vector is computed for each paragraph. Each paragraph vector is a mean of the vectors of the occurring words multiplied by a weighting factor for each word.Footnote 87 The weights are inversely proportional to the frequency of each word in the corpus, so that rare words have more impact on the averaged vector. The result is a vector representation for each paragraph. Finally, for each of these paragraph vectors, the cosine similarity to the two balancing vectors, one vector for each seed word list, is calculated. This method can be called a paragraph embedding approach.

Thus, four different measures of balancing language are made for all paragraphs: (1) Dictionary approach on original seed words, (2) dictionary approach on extended seed words, (3) paragraph embeddings on original seed words, and (4) paragraph embeddings on extended seed words. Finally, all measures were normalized by subtracting the mean and dividing by the standard deviation.

All four measures are supposed to measure balancing language in the GFCC case law. The scaling is intended to represent the intensity of balancing language in the given text passage. The following section examines whether the measure is sufficiently robust and which of the four measurement methods provides the best results.

D. Validation

Validation of the measures is crucial, as the performance of text-based methods is highly dependent on the area of use.Footnote 88 No in-depth experience has been gained with texts from German court decisions.Footnote 89 Therefore, it is particularly important to ensure that the measures actually measure what they are supposed to measure. In the following, the validation will be performed on three datasets in order to demonstrate the suitability of the methods used. The basis for the first validation is a newly created dataset in which three human annotators coded balancing in 1,500 paragraphs, as discussed in Section D.I. This dataset has been created for this Article and will serve as a basis for deciding which of the measures introduced above works best. Subsequently, two additional external data sources are used to ensure the validity of the measurement. On the one hand, data from PetersenFootnote 90 are used, who hand-coded the argument type of 240 decisions, discussed in Section D.II. On the other hand, data from Lüders et al.Footnote 91 is used, who annotated the proportionality analysis at sentence level for 300 decisions of the GFCC, as covered in Section D.III.

I. Balancing Language

For the first validation, new annotation data were created. Three students each hand-coded 1,500 paragraphs. All three were studying law at German universities and were at the end of their studies for the first state examination. The 1,500 paragraphs were randomly selected from the case law and each coder was asked to say for each paragraph whether balancing was applied or not. A short codebook with guidelines and examples was provided for this purpose.

There was agreement between the three coders on this task in seventy-eight percent of cases. In 1,142 cases there was agreement that there was no balancing. However, it is striking that one annotator recognized balancing in 326 cases, while the other two only recognized it in seventy-seven and seventy-three cases respectively. In only thirty-two cases did all three agree that there was balancing. In a further fifty-four cases at least two coders agreed, whereas in 272 cases, only one coder annotated balancing. Thus, agreement between annotators is not particularly strong.

There may be several reasons why the outcome of the task is so inconsistent. Of course, human annotations are always influenced by errors to some degree. However, the domain of the task certainly plays a particular role. The annotators stated afterwards that they found it difficult to read only isolated paragraphs and would have liked to know more context for the decision. It should also be remembered that it is not part of legal training to analyze in depth the type of arguments deployed by constitutional courts and that other studies using lawyers for coding also find high variance in their results.Footnote 92

Nevertheless, these data can be used to validate the measure.Footnote 93 From a methodological point of view, it is crucial that they are not understood as a gold standard,Footnote 94 but as an expression of a legal routine, for which it is obviously not clear in many cases whether balancing takes place or not. Against this background, an additive index is created from the three annotations, ranging from zero to three. The zero here expresses that no balancing language occurs, while three represents the cases in which balancing clearly occurs. The two values in between represent intermediate levels that display the intensity of the balancing language. Human disagreement in coding is, thus, understood as reflecting the strength of the linguistic expression of balancing language in the text.Footnote 95 Accordingly, it is crucial for the validation of the automated measurement that it captures this scale.

Figure 2 shows the average of the four measures of balancing language—including bootstrapped ninety-five percent Confidence Intervals—against the additive index of human annotation. All four measures are generally correct in their tendency to represent the additive index. The measures based on paragraph embeddings slightly outperform the measures based on dictionary approaches.Footnote 96 This is because the latter struggle with the first stage of the index and have much larger confidence intervals. Among the seed word lists, the extended list performs better, although it is quite surprising how well the original list performs, given that it consists of only four chosen seed words. Overall, the correlations are not particularly strong, which means that there is a lot of variance. However, the correlation of the best measure, paragraph embeddings on extended seed words, is close to r=0.5, which is sufficient given the difficulties the human annotators had and the typical scores for such validations otherwise.Footnote 97

Figure 2. Human annotation of balancing language (x-axis) against automated measures (mean with bootstrapped ninety-five percent Confidence Intervals). The measures of balancing language (y-axis) are each normalized (Mean = 0; SD = 1).

The first step has thus shown that the measurement of balancing language can map human intuitions, although these turn out to be not very consistent. The measure based on the paragraph embeddings on the extended seed word list turns out to be the most appropriate, so only this measure will be used in the remainder of the Article.

II. Decisive Arguments

The second validation uses data from Petersen.Footnote 98 Petersen conducted an argumentation analysis at the decision level, for which a system of ten argumentation categories was developed. The aim was to capture which argumentation techniques were crucial to the GFCC’s decision, with the possibility of more than one category being assigned to a decision. One category in the annotation scheme is balancing. Thus, for validation purposes, it is expected that more balancing language will be used in decisions where balancing has been annotated by Petersen.

A linear ordinary least squares regression was computed for validation.Footnote 99 The dependent variable is the balancing language—paragraph embeddings on extended seed words—with averages calculated for each decision. The independent variables are the ten dichotomous argumentation categories. Figure 3 visualizes the estimates. The balancing argumentation category has a strongly positive estimate. In contrast, as expected, there are no significant or strongly positive estimates for the categorical arguments, as well as for consistency, coherence, and deductive reasoning. The estimate for the rational connection and less restrictive means test category, the techniques corresponding to the second and third steps of the proportionality test, is surprisingly strong. However, Petersen also claims to have found a lot of implicit balancing within the less-restrictive-means test.Footnote 100

Figure 3. Regressions Estimates with ninety-five percent Confidence Intervals for a linear OLS model. Dependent variable: Balancing language (paragraph embeddings on extended seed words). Only those predictors whose error bars lie beyond the vertical 0 axis are significant.

III. Proportionality

For the third validation, a dataset is used in which proportionality tests have been annotated in 300 decisions of the GFCC.Footnote 101 Each sentence of the decisions was annotated based on whether it could be assigned to one of the steps of the proportionality test. In addition, there is a category for passages that obviously belong to a proportionality test but cannot be assigned to any step or can be assigned to several steps. In forty-nine of the 300 decisions, at least one of the proportionality test steps was annotated. To validate the measurement of balancing language, these forty-nine decisions are checked to see which passages use more balancing language. Obviously, the balancing step is expected to measure the strongest balancing language.

Figure 4 shows the mean scores—with ninety-five percent bootstrapped Confidence Intervals—for balancing language for each of the steps. The highest mean score is for the category representing text passages that are part of a proportionality test but were either not assigned to any step or were assigned to multiple steps. The high score is not surprising, as these passages are short, mostly found as introductions to tests, and use several key terms. Among the steps of the proportionality test, balancing clearly has the highest mean score. Interestingly, for the necessity step, the part of the test in which a less restrictive mean test is performed, a significantly higher mean score was found. This again confirms Petersen’s observation on implicit balancing in the less restrictive mean test.Footnote 102

Figure 4. Mean balancing language—paragraph embeddings on extended seed words—for the proportionality test steps—mean with bootstrapped ninety-five percent Confidence Intervals.

Thus, the performance of the balancing language measure was validated with three data sources. All three validation strategies showed that the scaling of balancing speech behaved as theoretically expected. Although the identification of balancing language proved difficult even for human annotators, the validation performed here is sufficient reason to assume that the measure captures the relevant trends in GFCC jurisprudence.

E. Results: The Usage of Balancing Language

Having obtained a robust measure of balancing language for the GFCC jurisdiction, the Article now turns to the hypotheses introduced earlier. Therefore, the following Section E.I. first provides an overview of the variables and the analysis, followed by a discussion of the results in Section E.II.

I. Model

The variables used are summarized in Table 1. Balancing language is the dependent variable. The measurement uses the paragraph embedding approach on the extended seed word list. Each paragraph of the reasoning section was scaled individually and then averaged for each decision (n=3,274). As before, the measured values were normalized.Footnote 103 The independent variables relevant to the hypothesis are, on the one hand, the year of the decision and, on the other hand, the dichotomous variable of whether a law is overturned in the decision.

Table 1. Descriptive statistics of the data used.

The first question to be addressed is whether more balancing language is used in decisions overturning laws. A naive approach would be to compare the average use of balancing language between the group of decisions overturning laws and the group of decisions not overturning laws. On average, more balancing language is, indeed, used in decisions overturning laws than in decisions not overturning laws. This comparison of the averages between the two groups of decisions is in line with hypothesis H1. However, the explanatory power of such a naive comparison is very limited: There are likely confounding variables.

Therefore, this Article uses regression analysis to better understand the relationships between variables.Footnote 104 The regression model includes control variables in order to isolate the theoretically relevant relationship from other potential confounders. Three control variables are used for the following analysis. The first is a dichotomous variable for the senate. The GFCC operates in two independent senates, each with different responsibilities. Accordingly, it is plausible that they behave in different ways.Footnote 105 Second, the length of the reasoning section is controlled for, as the decisions of the GFCC vary greatly in length. Therefore, the number of tokens in the justification was counted. As the distribution is highly skewed, the measure was logarithmized.Footnote 106 Third, a dichotomous control variable is introduced for the different types of proceedings. The GFCC has a number of different proceedings that lead to different types of decisions. The variable used here groups together, on the one hand, right based review proceedings: Namely constitutional complaints, abstract as well as specific judicial review of statutes. In these proceedings, violations of fundamental rights can be examined by the court. On the other hand, the second group include all other types of procedures—such as disputes between the highest federal organs, what are called Organstreit proceedings; disputes between the Federation and the Länder; proceedings to ban political parties; or electoral complaints.

The analysis was carried out using linear ordinary least squares regression calculated with the stastistical programming language R. As the time effect is not necessarily linear, squared and cubic terms for the year variable are included in the regression model. Table 2 shows the estimated model parameters of the preferred specification.Footnote 107

Table 2. Coefficients of OLS linear regression model.

The first three coefficients estimate the time effect. The quadratic and cubic year terms suggest that there is no mere linear relationship between the use of balancing language and the year of decision. The estimated temporal dynamics are best understood in visualized form. Figure 5 shows the values of balancing language predicted by the regression model over time for decisions overturning laws as well as for the reference group. The use of balancing language increases until around the year 2000. From this point on, a decreasing use of balancing language is modeled. Thus, a temporal effect on balancing language was indeed found, but it only behaves as expected for a certain period of time. The development after the year 2000, however, contradicts the expectations of the hypothesis. In this sense, hypothesis H2 must be qualified.

Figure 5. Predicted values of balancing language over time for decisions overturning laws versus decisions not overturning laws—including ninety-five percent Confidence Intervals, adjusted for a right-based review proceeding of the first senate with median length.

Also surprising is the coefficient for the variable, which depicts whether or not a law was overturned in the decision. Hypothesis H1 states that decisions overturning laws use more balancing language, which the naive empirical comparison of averages seems to confirm. However, when regression analysis is used to control for other variables, the positive effect disappears.Footnote 108 In fact, in the preferred model specification presented above there is even an effect in the opposite direction. Again, the reference category here is the class of decisions that do not overturn a law. The negative coefficient of the model, therefore, means that in decisions in which a law is overturned, less balancing language is used than in the reference class. This result, thus, contradicts the expectations of hypothesis H1.

The control variables also have plausible coefficients. They show that the longer the reasoning section, the more balancing language is used. According to the model, the two senates also differ in their use of balancing language. The first senate forms the reference category, so the negative effect indicates that the second senate uses less balancing language than the first. It can also be seen that significantly more balancing language is used in right-based review procedures.

II. Discussion

The empirical results thus shed mixed light on the hypotheses initially proposed. The expectations regarding the increase in the use of balancing language over time can only be partially confirmed. In its first years, the GFCC was more reticent and balancing in constitutional adjudication was not yet an established argumentation technique. There followed decades of increasing use of balancing language as the court became a stronger institution.Footnote 109 However, this trend appears to have changed after 2000, as the court seems to have become more reluctant. Such a trend reversal was not expected in the literature and is therefore an interesting empirical result. However, it is consistent with the findings of Petersen, who observed a decline in the 2010s.Footnote 110 Stohlmann and his colleagues also report an increase in the use of the proportionality test in the early decades, but conclude that it is by no means omnipresent.Footnote 111 Even in recent years, there is a large proportion of decisions in which proportionality is not invoked.Footnote 112

The model shows plausible coefficients for the control variables. That more balancing language tends to be used in longer resoning sections is obvious, because balancing in constitutional rights adjudication requires a great deal of argumentative space. It is also apparent that more balancing language tends to be used in right based review proceedings. The balancing debate is, after all, about balancing fundamental rights.Footnote 113 The model also shows an effect for the senates. Both senates work independently and the results seem to indicate that there are differences in the usage of balancing language, with the first senate using more balancing language than the second senate.

The hypothesis that was ultimately the starting point of the Article could not be confirmed. It was assumed that there would be more balancing in decisions overturning laws. However, when other variables were controlled for in the model, the expected positive effect disappeared. In the preferred specification the model even shows the opposite effect. Decisions that overturn laws tend to use less balancing language than other decisions. This insight was made possible by a new data set and new methodological approach. However, the findings are consistent with those of the other empirical studies. Petersen, who looked only at decisions overturning laws, concluded that balancing is less decisive in the GFCC jurisprudence than critics suggest.Footnote 114 Based on detailed argumentation analysis, Peterson states: “The preceding analysis has shown that, in most cases, the German Federal Constitutional Court does not use balancing to correct fundamental value decisions of the legislature.”Footnote 115

As a result, the findings of several empirical studies question the factual basis of the vehement criticism of balancing by the GFCC. They arrive based on empirical insights at a different description of the connection between this particular argumentation technique and judicial activism. Accordingly, a theoretical contextualization is necessary.

In order to contextualize the empirical findings with regard to balancing, first of all, it must be emphasized that the critics’ initial thesis is very broad. They see balancing as a method that is employed across the entire case law, that serves a strategic purpose, and that is increasingly being used. By contrast, a substantial body of German public law research argues that the GFCC uses balancing in a nuanced way. The literature emphasizes the sophisticated application requirements and mechanisms of proportionality and balancing.Footnote 116 This calls for a much more differentiated view of the court’s work. Therefore, the empirical findings presented in this study may not come as a surprise. But that requires critics to come up with better and more specific explanations and hypothesis. The general suspicion cannot persist.

This raises the question of what can be inferred about the strategic use of balancing. It seems that the GFCC is more defensive in the majority of potentially confrontational decisions than assumed in the literature.Footnote 117 The recognition of the contentious nature of balancing is a good reason for the court to refrain from balancing. Since the first studies of judicial behavior, it has been argued that constitutional courts undermine their own position if they are perceived as activist actors.Footnote 118 Not using controversial techniques may, therefore, prove to be a sensible strategy.

However, it would be misleading to conclude from the empirical results that the era of balancing and proportionality is over. The so called PSPP Decision of the GFCC on the Public Sector Asset Purchase Program of the European Central Bank, which falls within the last year covered by this study, is an extremely prominent and controversial counterexample.Footnote 119 In this decision, the GFCC declares a decision of the Court of Justice of the European Union (“CJEU”) to be ultra vires, largely on the grounds that the CJEU did not adequately apply the proportionality test—according to the GFCC.Footnote 120 On the one hand, the decision proves the role that proportionality can play and the extent to which it is distinctly German legal heritage. On the other hand, it is also proof of the diverse ways in which this concept can be applied. The use of proportionality is an atypical exception.Footnote 121 The case concerns the non-application of proportionality by another court, not the justification of a judgment on the basis of balancing. Accordingly, the methodological approach chosen here is certainly not the most suitable for this special case, which provoked whole debates.Footnote 122

Beyond balancing, the relevance of the present study for the relationship between legal argumentation and judicial activism are worth considering. The study focused on a very specific mechanism. It assumed that a particular argumentative technique, the use of balancing, was associated with a particular form of judicial activism, the overturning of laws. The presumed correlation could not be proven empirically. This is surprising, given the international academic attention.

Of course, this does not mean that there is no connection between the court’s legal argumentation method and judicial activism. There are two aspects to consider. First, judicial activism goes beyond simply overturning laws. The PSPP decision is a good example of this. It is obvious that this decision must be considered judicial activism; at the same time, this decision does not overturn any law. It is therefore worth considering the development of a different criterion of judicial activism for empirical research.

Second, there are other argumentation techniques and other forms of judicial discretion. Balancing is, of course, only one specific form of constitutional argumentation. There are many forms of judicial discretion used by the GFCC. One example is the interpretation of the eternity clause of Art. 79(3) of the Basic Law, which prohibits the amendment of certain parts of the constitution, although the GFCC understands it as a limitation to European integration.Footnote 123 There are also a number of rulings on distributive issues that derive a substantive minimum from human dignity, Art. 1(1) of the Basic Law, in conjunction with the principle of the social welfare state, Art. 20(1) of the Basic Law, which oblige the legislature to maintain certain standards.Footnote 124 Thus, there are several examples of what can be understood as judicial discretion—apart from balancing—that have provoked strong criticism.Footnote 125 However, in order to conduct empirical research into the toolbox of legal activism or judicial discretion, theories are needed that provide an explanation for the systematic use of certain argumentative tools.

Florian Meinel has recently made such a contribution, which is currently the subject of much discussion. Meinel argues that over the last 15 years the GFCC has undertaken a far-reaching reinterpretation of the German constitutional order.Footnote 126 Proportionality is seen as an important methodological tool.Footnote 127 However, the main focus is on the way in which the court interprets the law on the organization of the state. Meinel identifies a systematic technique used by the court—the application of administrative law standards to parliament and government. According to Meinel, “[t]he Court’s constitutional language has systematically levelled any difference between political institutions and administrative authorities.”Footnote 128 Meinel’s analysis is instructive in two ways. On the one hand, it fits with the findings presented here. Meinel’s diagnosis is consistent with the decline of balancing language in recent years. The GFCC focuses less on fundamental rights and more on the interpretation of competences in the law of state organization, where balancing is less common. On the other hand, Meinel formulates an alternative theory of the argumentative means used by the Court. Meinel sees the application of administrative law standards to parliamentary institutions as a systematic, strategic approach by the GFCC.

Overall, the relationship between judicial argumentation and judicial activism remains an important question for empirical research. The aim of this study was to provide quantitative insights into the use of a particular argumentative techniques. Therefore, a major focus of the Article was to demonstrate that quantitative textual analysis methods can be reliably used for this task. This study is thus one of the first empirical contributions to provide insights into the strategic use of argumentation techniques for such a large number of cases.

At the same time, of course, the empirical evidence presented here comes with limitations. It is obvious that judicial reasoning is more complex and multifaceted than can be captured by measuring balancing language. In this sense, more effort is needed in the future to do empirical justice to the phenomenon of legal reasoning.Footnote 129 The analysis strategy used also has its limitations. This Article uses a simple linear regression model with only a few control variables. This Article itself demonstrates the importance of the choice of control variables. Accordingly, it is essential for further empirical analysis of balancing, but also for legal reasoning in general, to gain a deeper understanding of how activist decisions differ from others. This requires, in particular, conceptual work for better empirical measures. There are also a growing number of attempts to use specific research strategies and methods to carry out causal analyses.Footnote 130 Such analyses promise more rigorous results, but they are also much more demanding.Footnote 131 In particular, they require a causal theory of how courts decide and argue. The present study is more modest in this respect. It does not claim causality, but merely points out that decisions overturning laws do not simply use more balancing and the use of balancing has not simply continued to increase. In any case, this empirical work calls for new empirical and theoretical work on the Court’s strategic use of judicial argumentation techniques. In many ways, therefore, this Article demonstrates that the use of textual methods for the study of judicial decisions still has much potential.

F. Conclusion

This Article used word embeddings, a state-of-the-art approach to text analysis, to empirically investigate a specific argumentation technique, namely balancing, in the GFCC jurisprudence. This demonstrates the potential for judicial behavior research to use text-as-data methods to illuminate judicial argumentation. Contrary to theoretical expectations, the GFCC does not simply use more balancing language in cases where laws are overturned. It has also been shown that the use of balancing language only increased in the first fifty years of its existence until around 2000, as had been assumed. Since then, a decline has been observed. Thus, it appears that empirical research on the behavior of courts can fruitfully contribute to the theoretical debate on constitutional courts and their reasoning. For the debate on balancing in GFCC jurisprudence, it is no longer possible to make the sweeping claim that the court is more likely to engage in balancing when confronting the legislature.

Acknowledgments

In particular, I would like to thank Alexander Tischbirek for always being sure that there is such a thing as balancing language, and Christoph Möllers for always being sure that I could measure it. I would also like to thank the three annotators, and my colleagues Benjamin Engst, Anselm Hager, Christian Rauh, Bent Stohlmann, Nils Weinberg, Luisa Zimmer, and Lisa Zehnter for their helpful feedback.

Competing Interests

The author has no competing interests to declare that are relevant to the content of this Article.

Funding Statement

The work was funded by the DFG Leibniz Prize for Prof. Dr. Christoph Möllers, LL.M., which was awarded by the public German Research Foundation (Deutsche Forschungsgemeinschaft).

References

1 See, e.g., Dieter Grimm, Proportionality in Canadian and German Constitutional Jurisprudence, 57 U. Toronto L.J. 383, 385 (2007) (explaining the concept of balancing as used in German constitutional courts); Aharon Barak, Proportionality: Constitutional Rights and Their Limitations 180 (2012).

2 See, e.g., Talya Steiner, Andrej Lang & Mordechai Kremnitzer, Introduction: Analysing Proportionality Comparatively and Empirically, in Proportionality in Action 1 (Mordechai Kremnitzer, Talya Steiner & Andrej Lang eds., 2020) (noting the German origin of the balancing test); Alec Stone Sweet & Jud Mathews, Proportionality Balancing and Constitutional Governance: A Comparative and Global Approach 59–61 (2019) (the same).

3 See, e.g., Kai Möller, The Global Model of Constitutional Rights 13 (2015); Stone Sweet & Mathews, supra note 2, at 1.

4 See, e.g., Bernhard Schlink, Proportionality, in The Oxford Handbook of Comparative Constitutional Law 718, 735 (Michel Rosenfeld & Andrâs Sajó eds., 2012) .

5 A major exception is the work of Niels Petersen, Proportionality and Judicial Activism: Fundamental Rights Adjudication in Canada, Germany and South Africa (2017), discussed in detail in Section B.II. of this Article.

6 See Jacco Bomhoff, Balancing Constitutional Rights 16–18 (2015).

7 See generally Jeffrey Segal, Separation-of-Powers Games in the Positive Theory of Congress and Courts, 91 Am. Pol. Sci. Rev. 28 (1997); Lee Epstein & Jack Knight, The Choices Justices Make (1998); Lee Epstein & Jack Knight, Toward a Strategic Revolution in Judicial Politics: A Look Back, A Look Ahead, 53 Pol. Rsch. Q. 625 (2000); Clifford J. Carrubba, Matthew Gabel & Charles Hankla, Judicial Behavior Under Political Constraints: Evidence from the European Court of Justice, 102 Am. Pol. Sci. Rev. 435 (2008).

8 See generally Georg Vanberg, Legislative-Judicial Relations: A Game-Theoretic Approach to Constitutional Review, 45 Am. J. Pol. Sci. 346 (2001); Georg Vanberg, The Politics of Constitutional Review in Germany (2005); Jeffrey K. Staton & Georg Vanberg, The Value of Vagueness: Delegation, Defiance, and Judicial Opinions, 52 Am. J. Pol. Sci. 504 (2008); Jay Krehbiel, The Politics of Judicial Procedures: The Role of Public Oral Hearings in the German Constitutional Court, 60 Am. J. Pol. Sci. 990 (2016); Benjamin Engst, The Two Faces of Judicial Power: Dynamics of Judicial-Political Bargaining (2021).

9 See generally Lee Epstein & Keren Weinshall, The Strategic Analysis of Judicial Behavior: A Comparative Perspective (2021); Arthur Dyevre, Unifying the Field of Comparative Judicial Politics: Towards a General Theory of Judicial Behaviour, 2 Eur. Pol. Sci. Rev. 297 (2010).

10 Tom S. Clark, Measuring Law, in Routledge Handbook of Judicial Behavior 84, 93 (Robert M. Howard & Kirk A. Randazzo eds., 2018); Jeffrey Lax, The New Judicial Politics of Legal Doctrine, 14 Ann. Rev. Pol. Sci. 131, 132 (2011).

11 Christoph Möllers, Legality, Legitimacy, and Legitimation of the Federal Constitutional Court, in The German Federal Constitutional Court 131, 145–47 (Matthias Jestaedt, Oliver Lepsius, Christoph Möllers & Christoph Schönberger eds., 2020); Epstein & Weinshall, supra note 9, at 35–36.

12 Duncan Kennedy, A Transnational Genealogy of Proportionality in Private Law, in The Foundations of European Private Law 185 (Roger Brownsword, Hans-Wolfgang Micklitz, Leone Niglia & Stephen Weatherill eds., 2011); Bomhoff, supra note 6, at 1–2.

13 See generally T. Alexander Aleinikoff, Constitutional Law in the Age of Balancing, 96 Yale L.J. 943 (1987); Vicki Jackson, Constitutional Law in an Age of Proportionality, 124 Yale L.J. 3094 (2015).

14 Barak, supra note 1, at 180–81; Stone Sweet & Mathews, supra note 2, at 60–69.

15 Werner Heun, The Constitution of Germany: A Contextual Analysis 196–97 (2011).

16 Möller, supra note 3, at 13; Stone Sweet & Mathews, supra note 2, at 67–68.

17 Grimm, supra note 1, at 385.

18 Barak, supra note 1, at 19–20; Anne Peters, A Plea for Proportionality: A Reply to Yun-Chien Chang and Xin Dai, 19 Int’l J. Const. L. 1135, 1143–45 (2021). See generally Robert Alexy, The Absolute and the Relative Dimension of Constitutional Rights, 37 Oxford J. Legal Stud. 31 (2017); Mattias Kumm, The Idea of Socratic Contestation and the Right to Justification: The Point of Rights-Based Proportionality Review, 4 Law & Ethics Hum. Rts. 142 (2010).

19 Grégoire C.N. Webber, The Negotiable Constitution on the Limitation of Rights 87–89, 100, 113–15 (2012); Aleinikoff, supra note 13, at 972–95.

20 Bernhard Schlink, Der Grundsatz Der Verhältnismäßigkeit, in Festschrift 50 Jahre BVerfG Band II 445 (Peter Badura & Horst Dreier eds., 2001); Schlink, supra note 4.

21 Webber, supra note 19, at 97.

22 Schlink, supra note 4, at 735; Aleinikoff, supra note 13, at 984.

23 Oliver Lepsius, The Standard-Setting Power, in The German Federal Constitutional Court, supra note 11, at 70, 99; Petersen, supra note 5, at 58.

24 Michaela Hailbronner & Stefan Martini, The German Federal Constitutional Court, in Comparative Constitutional Reasoning 356, 387 (Andras Jakab, Arthur Dyevre & Giulo Itzcovich eds., 2017).

25 Heun, supra note 15, at 43; Ralf Poscher, Das Grundgesetz Als Verfassung Des Verhältnismäßigen Ausgleichs, in Handbuch des Verfassungsrechts: Darstellung in transnationaler Perspektive 149, 159 (Ralf Poscher, Klaus Ferdinand Gärditz, Matthias Herdegen & Johannes Masingeds eds., 2021).

26 Matthias Jestaedt, Verhältnismäßigkeit als Verhaltensmaß, in Verhältnismäßigkeit 293, 293 (Matthias Jestaedt & Oliver Lepsius eds., 2021).

27 Guy Lurie, Proportionality and the Right to Equality, 21 German L.J. 174, 191 (2020).

28 Petersen, supra note 5, at 60, 67–68.

29 Bomhoff, supra note 6, at 19.

30 For the operationalization, see Section B.III. of this Article.

31 Vanberg, supra note 8, at 348, 355; Vanberg, supra note 8, at 24, 102; Krehbiel, supra note 8, at 992, 997.

32 Petersen, supra note 5, at 68.

33 See, e.g., Schlink, supra note 20, at 445–446.

34 Petersen, supra note 5, at 68, 86–87; Carrubba et al., supra note 7; Clifford J. Carrubba & Christopher Zorn, Executive Discretion, Judicial Decision Making, and Separation of Powers in the United States, 72 J. Pol. 812 (2010).

35 Christoph Schönberger, Karlsruhe: Notes on a Court, in The German Federal Constitutional Court, supra note 11, at 1, 8–10; Justin Collings, Democracy’s Guardians: A History of the German Federal Constitutional Court 1951-2001, at 61–62 (2015).

36 Wolfram Cremer, The Basic Right to “Free Development of the Personality”, in Debates in German Public Law 57 (Hermann Plünder & Christian Waldhoff eds., 2014).

37 Matthias Jestaedt, The Karlsruhe Phenomenon: What Makes the Court What It Is, in The German Federal Constitutional Court, supra note 11, at 32, 66.

38 Lepsius, supra note 23, at 95–96.

39 Steiner et al., supra note 2, at 3.

40 Petersen, supra note 5, at 6–8.

41 Id. at 60.

42 Id. at 71–79.

43 Id. at 70.

44 Id. at 82–84, 91.

45 Id. at 158.

46 Id. at 70. There are also further limitations in the case selection: For example, the exclusion of decisions based on Article 3 of the Basic Law, the general right to equality. Id.

47 Andrej Lang, Proportionality Analysis by the German Federal Constitutional Court, in Proportionality in Action, supra note 2, at 22.

48 Id. at 24–25.

49 Von Bent Stohlmann, Kilian Lüders, Alexander Tischbirek, Luisa Wendel, Leonard Hoeft & Sophie Reule, Konsolidierung Statt Siegeszug, 63 Der Staat 217 (2024).

50 This dataset will be used to validate the measurement used in this Article. See infra Section D.III.

51 See generally Talya Steiner, Liat Netzer & Raanan Sulitzeanu-Kenan, Necessity or Balancing: The Protection of Rights under Different Proportionality Tests, 20 Int’l J. Const. L. 642 (2022).

52 See generally Frederike Zufall, Rampei Kimura and Linyu Peng, Towards a Simple Mathematical Model for the Legal Concept of Balancing of Interests, 31 A.I.L. 807 (2023).

53 Kennedy, supra note 12, at 189; Bomhoff, supra note 6, at 16.

54 Lang, supra note 47, at 22; Grimm, supra note 1; Stone Sweet & Mathews, supra note 2, at 1.

55 Bomhoff, supra note 6, at 17.

56 Bundesverfassungsgericht [BVerfG] [Federal Constitutional Court] Apr. 24, 1991, 84 Entscheidungen Des Bundesverfassungsgericht [BVerfGE] 133, para. 80 (Ger.).

57 Bundesverfassungsgericht [BVerfG] [Federal Constitutional Court] Nov. 17, 2009, 124 Entscheidungen Des Bundesverfassungsgericht [BVerfGE] 300, para. 52 (Ger.).

58 It is therefore important to distinguish between scaling and classification. See Justin Grimmer & Brandon Stewart, Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts, 21 Pol. Analysis 267, 267 (2013).

59 See generally Michael Laver, Kenneth Benoit & John Garry, Extracting Policy Positions from Political Texts Using Words as Data, 97 Am. Pol. Sci. Rev. 311 (2003); Jonathan B. Slapin & Sven-Oliver Proksch, A Scaling Model for Estimating Time-Series Party Positions from Texts, 52 Am. J. Pol. Sci. 705 (2008).

60 See generally Ludovic Rheault & Christopher Cochrane, Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora, 28 Pol. Analysis 112 (2020).

61 See generally Stefan Munnes, Corinna Harsch, Marcel Knobloch, Johannes S. Vogel, Lena Hipp & Erik Schilling, Examining Sentiment in Complex Texts: A Comparison of Different Computational Approaches, 5 Frontier Big Data (2022); Douglas Rice & Christopher Zorn, Corpus-Based Dictionaries for Sentiment Analysis of Specialized Vocabularies, 9 Pol. Sci. Rsch. & Methods 20 (2021).

62 See generally Austin C. Kozlowski, Matt Taddy & James A. Evans, The Geometry of Culture: Analyzing the Meanings of Class Through Word Embeddings, 84 Am. Socio. Rev. 905 (2019); Jason J. Jones, Mohammad Ruhul Amin, Jessica Kim & Steven Skiena, Stereotypical Gender Associations in Language Have Decreased Over Time, 7 Socio. Sci. 1 (2020); Alina Arseniev-Koehler & Jacob G. Foster, Machine Learning as a Model for Cultural Learning: Teaching an Algorithm What It Means to Be Fat, 51 Socio. Methods & Rsch. 1484 (2022).

63 See Jens Frankenreiter & Michael A. Livermore, Computational Methods in Legal Analysis, 16 Ann. Rev. L. & Soc. Sci. 39, 39 (2020).

64 See generally Michael C. Evans, Wayne V. McIntosh, Jimmy Lin & Cynthia L. Cates, Recounting the Courts? Applying Automated Content Analysis to Enhance Empirical Legal Research: Automated Content Analysis to Enhance Empirical Legal Research, 4 J. Empirical Legal Stud. 1007 (2007); Benjamin E. Lauderdale & Tom S. Clark, Scaling Politically Meaningful Dimensions Using Texts and Votes, 58 Am. J. Pol. Sci. 754 (2014).

65 See generally Arthur Dyevre, The Promise and Pitfall of Automated Text-Scaling Techniques for the Analysis of Jurisprudential Change, 29 A.I.L. 239 (2021).

66 See generally Elliott Ash, Daniel L. Chen & Sergio Galletta, Measuring Judicial Sentiment: Methods and Application to U.S. Circuit Courts, 89 Economica 362 (2022).

67 Christian Arnold, Benjamin Engst & Thomas Gschwend, Scaling Court Decisions with Citation Networks, 11 J.L. & Cts. 25, 27 (2021).

68 Christoph Möllers & Luisa Wendel, Korpus der Entscheidungen des Bundesverfassungsgerichts – Dataset (Version 2.0), Zenodo (Dec. 15, 2023), https://zenodo.org/records/10369205.

69 The “de_core_news_lg” model was used. https://spacy.io/models/de#de_core_news_lg (last visited Feb. 16, 2025).

70 Luisa Wendel, Metadaten zu Entscheidungen des Bundesverfassungsgerichts – Dataset (Version 2.6.1), Zenodo (Dec. 14, 2023), https://zenodo.org/records/10378324.

71 The resulting variable was checked against the case selection of Petersen. See Petersen, supra note 5, at 197–206.

72 See generally Peter Turney & Patrick Pantel, From Frequency to Meaning: Vector Space Models of Semantics, 37 J.A.I. Rsch. 141 (2010). For another application of this technique with legal data, see Ryan Whalen, Alina Lungeanu, Leslie DeChurch & Noshir Contractor, Patent Similarity Data and Innovation Metrics, 17 J. Empirical Legal Stud. 615 (2020).

73 Alessandro Lenci, Distributional Models of Word Meaning, 4 Ann. Rev. Linguistics 151, 161–62 (2018).

74 For probably the best-known example of the power of word embeddings, see Tomas Mikolov, Kai Chen, Greg Corrado & Jeffrey Dean, Efficient Estimation of Word Representations in Vector Space, arXiv (Jan. 16, 2013), https://arxiv.org/abs/1301.3781. Mikolov et al. demonstrated that one can do calculations with these vectors. It was found that the vector for king minus the vector for man plus the vector for woman gives a vector that is almost equal to the vector for queen. Id.

75 Rheault & Cochrane, supra note 60, at 112.

76 Kozlowski et al., supra note 62, at 911–14.

77 Rice & Zorn, supra note 61, at 23.

78 Mikolov et al., supra note 74, at 4–5.

79 See generally Radim Řehůřek & Petr Sojka, Software Framework for Topic Modelling with Large Corpora, NLP Centre (2010). For more technical information on word embeddings, see generally Pedro Rodriguez & Arthur Spirling, Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research, 84 J. Politics 101 (2022). Configurations: 200 epochs, 200 dimensional vectors, 12 words window size. The model contains 55,906 words and is available. See Kilian Lüders, BVerfG - Word Embedding (Version 1.0), Zenodo (Apr. 2, 2024) https://doi.org/10.5281/zenodo.10908253.

80 Heun, supra note 15, at 208.

81 This is not very impressive, but it proves that the training has worked as it should if it assigns words in different grammatical forms to each other.

82 Lurie, supra note 27, at 189–90.

83 Bomhoff, supra note 6, at 9, 19.

84 The terms are: Desto (the more), abwägen (balancing), angemessenen (appropriate), größeres (greater), schwere (heavy), stärker (stronger), größer (greater), Ausgleich (balance), Gewicht (weight), abgewogen (weighed), überwiegen (outweigh), widerstreitenden (conflicting), verhältnismäßig (proportionate), Konkordanz (concordance), and gegeneinander (against). This extended seed word list was created on the basis of examples from the case law.

85 See Rice & Zorn, supra note 61, at 23–24; Munnes et al., supra note 61, at 8. Dictionaries are lists of words whose occurrences are counted in the text. See generally Justin Grimmer, Margaret Roberts & Brandon Stewart, Text as Data: A New Framework for Machine Learning and the Social Sciences 180 (2022).

86 Justin Garten, Joe Hoover, Kate M. Johnson, Reihane Boghrati, Carol Iskiwitch & Morteza Dehghani, Dictionaries and Distributions: Combining Expert Knowledge and Large Scale Textual Data Content Analysis, 50 Behav. Res. 344, 347 (2018).

87 Sanjeev Arora, Yingyu Liang & Tengyu Ma, A Simple but Tough-to-Beat Baseline for Sentence Embeddings, OpenReview.net (Feb. 6, 2017), https://openreview.net/forum?id=SyK00v5xx.

88 Grimmer et al., supra note 85, at 181; Wouter van Atteveldt, Mariken A. C. G. van der Velden & Mark Boukes, The Validity of Sentiment Analysis: Comparing Manual Annotation, Crowd-Coding, Dictionary Approaches, and Machine Learning Algorithms, 15 Commc’n Methods & Measures 121, 124 (2021); Christian Rauh, Validating a Sentiment Dictionary for German Political Language—A Workbench Note, 15 J. Info. Tech. & Pol 319, 321 (2018).

89 The performance of established dictionaries on more complex German texts is already mixed. See Munnes et al., supra note 61, at 13–15. Furthermore, it should be emphasized that it is questionable whether assumptions made for text scaling in political texts or sentiment analysis can be easily transferred to court decisions. For instance, it is assumed that texts contain well distinguishable features and pattern of word usage, on which statistical models can optimize. See Kohei Watanabe, Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages, 15 Commc’n Methods & Measures 81, 82 (2021).

90 Petersen, supra note 5, at 197–206.

91 Kilian Lüders, Luisa Wendel, Sophie Reule, Bent Stohlmann, Leonard Hoeft & Alexander Tischbirek, Verhältnismäßigkeit - Proportionality. An annotated dataset of GFCC decisions – Dataset (Version 1.0), Zenodo (Jan. 15, 2024), https://doi.org/10.5281/zenodo.10513684.

92 Dyevre, supra note 65, at 259.

93 Lori Young & Stuart Soroka, Affective News: The Automated Coding of Sentiment in Political Texts, 29 Pol. Comm. 205, 215 (2012).

94 Rauh, supra note 88, at 325–27.

95 See id. at 327 (presenting Sentiment Analysis); Young & Soroka, supra note 93, at 215.

96 Pearson correlations against the additive index: (1) Dictionary approach on original seed words r=0.29, (2) Dictionary approach on extended seed words r=0.35, (3) Paragraph embeddings on original seed words r=0.42, and (4) Paragraph embeddings on extended seed words r=0.49.

97 Munnes et al., supra note 61, at 12; Dyevre, supra note 65, at 260.

98 Petersen, supra note 5, at 197–206.

99 For more information on data and validation, see the Appendix linked at the end of this Article.

100 Petersen, supra note 5, at 130–33.

101 Stohlmann et al., supra note 49, at 230.

102 Petersen, supra note 5, at 130–31.

103 The mean was subtracted and then divided by the standard deviation.

104 The use of a regression model is the reasonable next step in terms of methodological complexity. Of course, such simple regression models cannot be used to postulate causal relationships. They only serve to better describe the relationships between independent and dependent variables. For comparison of this discussion and call for the next steps of empirical analysis below, see Section E.II. of this Article.

105 Schönberger, supra note 35, at 5.

106 I note that logarithmizing has a rather strong effect on the estimated coefficient, suggesting that outliers do play a crucial role in the analysis.

107 There is a lot of variance in the coefficients depending on which variables are included and whether they are logged.

108 In many alternative model configurations, for example when controlling for year and senate, or year and right-based review, or simply text length, there is no significant effect for the variable overturning laws.

109 Petersen, supra note 5, at 87.

110 Id. at 82.

111 Stohlmann et al., supra note 49, at 247.

112 Id. at 238.

113 Bomhoff, supra note 6, at 1.

114 Petersen, supra note 5, at 158.

115 Id. at 175.

116 Alexander Tischbirek, Die Verhältnismäßigkeitsprüfung: Methodenmigration zwischen öffentlichem Recht und Privatrecht 216 (2017); Thorsten Kingreen & Ralf Poscher, Grundrechte Staatsrecht II: Lehrbuch & Entscheidungen 426–27 (39th ed. 2023); Poscher, supra note 25, at 187–88.

117 Petersen, supra note 5, at 67.

118 See generally Gregory A. Caldeira, Neither the Purse Nor the Sword: Dynamics of Public Confidence in the Supreme Court, 80 Am. Polit. Sci. Rev. 1209 (1986).

119 See generally Bundesverfassungsgericht [BVerfG] [Federal Constitutional Court] May 5, 2020, 154 Entscheidungen des Bundesverfassungsgericht [BVerfGE] 17 [hereinafter Judgment of May 5, 2020] (Ger.).

120 Judgment of May 5, 2020 at 116, 167–78.

121 Franz C. Mayer, To Boldly Go Where No Court Has Gone Before. The German Federal Constitutional Court’s ultra vires Decision of May 5, 2020, 21 German L.J. 1116, 1123–24 (2020).

122 See, e.g., 21(5) German L.J. (2020) (presenting a collection of articles concerning the PSPP debate).

123 Collings, supra note 35, at 295–97.

124 See, e.g., Bundesverfassungsgericht [BVerfG] [Federal Constitutional Court] Oct. 19, 2022, 125 Entscheidungen des Bundesverfassungsgericht [BVerfGE] 175 (the Hartz IV unemployment benefits case); Bundesverfassungsgericht [BVerfG] [Federal Constitutional Court] Oct. 19, 2022, 163 Entscheidungen des Bundesverfassungsgericht [BVerfGE] 254 (the Asylbewerberleistungsgesetz [Asylum Seekers Benefits Act] case).

125 Möllers, supra note 11, at 182–83; Lepsius, supra note 23, at 99–100, 110–11.

126 Florian Meinel, The Merkel Court: Judicial Populism Since the Lisbon Treaty, 19 Eur. Const. L. Rev. 111, 112 (2023).

127 Id. at 117.

128 Id. at 130.

129 See generally Barry Friedman, Taking Law Seriously, 4 Persps. on Pol. 261 (2006).

130 Elliott Ash, Judge Language Choices, 27 The Oxford Handbook of Comparative Judicial Behaviour 591 (Lee Epstein, Gunnar Grendstad, Urška Šadl & Keren Weinshalleds., 1st ed. 2024).

131 See generally Empirical Constitutional Scholarship: ICON: Debate!, 19 Int’l J. Const. L. 1810 (2021).

Figure 0

Figure 1. Conceptual plot of different word vectors. A new vector, an abstract representation of balancing, is created from four seed word vectors: Intensiv (intensive), desto (the more), Verhältnismäßigkeitsprinzip (principle of proportionality), and Abwägung (balancing).

Figure 1

Figure 2. Human annotation of balancing language (x-axis) against automated measures (mean with bootstrapped ninety-five percent Confidence Intervals). The measures of balancing language (y-axis) are each normalized (Mean = 0; SD = 1).

Figure 2

Figure 3. Regressions Estimates with ninety-five percent Confidence Intervals for a linear OLS model. Dependent variable: Balancing language (paragraph embeddings on extended seed words). Only those predictors whose error bars lie beyond the vertical 0 axis are significant.

Figure 3

Figure 4. Mean balancing language—paragraph embeddings on extended seed words—for the proportionality test steps—mean with bootstrapped ninety-five percent Confidence Intervals.

Figure 4

Table 1. Descriptive statistics of the data used.

Figure 5

Table 2. Coefficients of OLS linear regression model.

Figure 6

Figure 5. Predicted values of balancing language over time for decisions overturning laws versus decisions not overturning laws—including ninety-five percent Confidence Intervals, adjusted for a right-based review proceeding of the first senate with median length.