A. Introduction
Throughout the twentieth century, it became an established practice for many constitutional courts to deploy balancing in constitutional rights adjudication. In particular, the German Federal Constitutional Court (“GFCC”) stands out as a pioneer of the proportionality test.Footnote 1 Given that several other constitutional courts have adopted the proportionality test, the GFCC is considered particularly influential.Footnote 2 While some authors enthusiastically describe this development as a global spread of fundamental rights protection,Footnote 3 others are rather critical of this development. They argue balancing of being a technique used for judicial activism.Footnote 4 According to this view, balancing allows constitutional courts too much leeway in justifying their decisions, leading to judicial self-empowerment. As a result, these authors argue, constitutional courts would resort to balancing when they want to assert themselves against the legislature.
It is important to note that balancing is not a typical method employed by courts in their decisions. Judicial reasoning is usually categorical and subsuming under abstract definitions—for example, stating whether a contract is valid or whether a planning application has been wrongly refused. It is therefore an exceptional form of reasoning for courts that is of interest here. While the use of balancing by courts has received a great deal of attention in the academic literature, there is currently very little empirical evidence on how courts utilize this tool.Footnote 5 What remains insufficiently substantiated in this context are the accusations that constitutional courts are increasingly using balancing as a tool for judicial activism. One particularly powerful form of judicial activism occurs when the court opposes the legislature by overturning a law. But is it empirically true that constitutional courts use balancing more often when they strike down legislation?
This Article seeks to address these questions by empirically examining the use of balancing language in the GFCC jurisprudence. Balancing language refers to the typical phrases and language used by the court when balancing.Footnote 6 To empirically measure balancing language, state-of-the-art text-as-data methods are used to automatically analyze 3,274 decisions. This involves creating a word embedding model that is used to scale balancing language. The use of this method for the study of judicial argumentation techniques is new, which is why a particularly detailed validation is carried out.
Thus, it can be shown empirically that the use of balancing language by the GFCC increased during the first fifty years of the Court’s existence. Since then, however, it has tended to decline. Also, the hypothesis that more balancing language is used in decisions in which the court strikes down legislation cannot be confirmed.
This empirical study contributes to the literature in several ways: It adds new empirical insights to the debate on the use of balancing in constitutional right adjudication and the critique of judicial activism. At the same time, it also contributes to the study of judicial behavior by examining the strategic use of argumentative means by courts in their decisions. Finally, this Article also adds to the methodological text-as-data literature insofar as methods are applied to new text genres and research questions. This is one of the first attempts to use scaling methods to study argumentation techniques in case law.
The remainder of the Article is organized as follows: First, Section B introduces the debate on balancing and the hypotheses. Second, Section C presents the data and methodology for measuring balancing language. Third, Section D provides an extensive validation of the newly introduced measurement. Finally, Section E tests the hypotheses based on the newly created data.
B. Balancing
Empirical research on judicial behavior and separation of powers has made a significant contribution to explaining the strategic behavior of constitutional courts.Footnote 7 Even for the GFCC, a growing number of studies show how the court behaves as a strategic, and at times activist, actor.Footnote 8 Despite the variety of research approaches,Footnote 9 the empirical study of the judicial tools used by courts is still in its infancy.Footnote 10 There is a gap in the research on how certain techniques of constitutional reasoning are used by the courts.Footnote 11 One such argumentative technique that has received considerable attention is balancing. This Article attempts to empirically examine the strategic use of balancing as a specific constitutional argumentative technique that has been the subject of controversy in legal scholarship for some time. The conceptual framework of the study will be presented in three steps. First, the current debate and criticism of constitutional balancing will be summarized, and the hypotheses derived in Section B.I. Second, the empirical research on how the GFCC uses balancing will be discussed in Section B.II. Third, the object of the study will be specified by clarifying what is meant by balancing language in Section B.III.
I. Theory and Hypotheses
The debate on balancing in law unfolded over the course of the twentieth century,Footnote 12 and the age of constitutional balancing has been diagnosed for more than thirty years now.Footnote 13 The German Federal Constitutional Court, as the inventor of the proportionality test, receives particular attention in this context.Footnote 14 The proportionality test is a four-step review process originally introduced to control state interference with individual liberties: The legitimate aim, the suitability as well as the necessity of the chosen means and, finally, the appropriateness are examined, with the final step being a balancing process.Footnote 15 The proportionality test was adopted—albeit in modified form—by other constitutional courts, which was the starting point for a debate on the globalization of constitutional law.Footnote 16 Today, the principle of proportionality is often taken for granted,Footnote 17 as an indispensable tool in the review of constitutional rights or even as a basic element of the rule of law.Footnote 18
Despite its established status, proportionality has always been critically commented on, especially because of balancingFootnote 19 and especially in regard to the GFCC.Footnote 20 It is argued that balancing is not a rational legal argumentation technique because there can be no fixed legal standards.Footnote 21 Weighing is always necessary and should be done by the democratically legitimated legislature and not by the courts.Footnote 22 Balancing thus opens up certain freedoms for the courts and invites judicial activism. In this sense, the criticism of the GFCC is quite specifically that it uses balancing or the proportionality test to give itself more room for maneuver in decision-making vis-à-vis politics.Footnote 23
This critical view is not shared by everyone in the literature. Many see proportionality as just another legal principle of German public lawFootnote 24 that enables the mediation of individual rights and public interests.Footnote 25 It is argued that proportionality is also equipped with nuanced internal standards. Depending on the fundamental right in question and the characteristics of the case, a “rich casuistry” facilitates differentiated standards of review.Footnote 26 The argumentative approaches can differ considerably. For example, cases involving the principle of equality found in Art. 3(1) of the Basic Law, can be reviewed as a mere check on arbitrariness, but can also lead to a detailed four-step review.Footnote 27 The proportionality test and balancing are therefore not generally suspected in the German public law discourse. They are often seen as an obvious strategy, which do not necessarily imply a strategic motive.
The debate on the use of balancing by the GFCC is thus quite controversial. Many aspects of the debate involve very technical legal issues and raise questions about how the court should argue. However, the position of the critics who assume a link between the judicial decision-making technique and judicial activism can be understood as an empirical claim that can be examined:Footnote 28 Do courts actually use certain argumentation techniques strategically when they want to realize power vis-à-vis the legislature? Is there more balancing in such confrontational decisions?
Empirical research can contribute to this debate by shedding light on the actual use of balancing. Of course, this can only be one aspect of the broader debate on judicial activism. Judicial activism can take many different forms. In addition to balancing, there may be other strategies for the court to be activist, and various forms of judicial discretion can be distinguished. This raises a whole range of potential empirical research questions. However, this Article will focus on balancing, as this is an issue of particular interest in the global debate, especially with regard to the GFCC.
It is important to note that balancing goes beyond the proportionality test. Balancing is not only found in structured schemes, such as the proportionality test but is also often used freely, in an unstructured form, by the Court.Footnote 29 Thus, all explicit mentions of balancing methods, as well as all more implicit, weighting formulations and uses of language, should be included in an examination of the strategic use of balancing as a tool of judicial activism.Footnote 30
When it comes to judicial activism, or more fundamentally measuring the Court’s decision-making, the empirical literature on the GFCC often focuses on the Court’s power to overturn legislation.Footnote 31 In these decisions, judicial activism is particularly evident because the court is ruling against a legislative body that has been legitimated by democratic elections. It is, therefore, a direct demonstration of power against the legislatureFootnote 32 to which the GFCC is legally entitled under § 31 of BVerfGG, the German Constitutional Court Act. This study builds on the empirical literature and considers decisions overturning laws and/or declaring them unconstitutional as an expression of judicial activism. This understanding of judicial activism, combined with the critique of balancing as a tool of judicial activism, leads to the first hypothesis:
H1: In decisions overturning laws, the Court uses more balancing language.
The next hypothesis concerns the development over time. Critics often accuse the court of increasingly using balancing.Footnote 33 Two different but compatible theories support this claim. The first theory is concerned with the influence of the institutional strength and position of the court in the political environment on its decision-making.Footnote 34 In the early Federal Republic, the GFCC was a new institution that first had to attain its position of power.Footnote 35 Accordingly, it is to be expected that in the majority of decisions, the Court initially adopted a more cautious approach in its reasoning, which also applies in particular to newer techniques such as the proportionality test.
The second line of reasoning is based on the practice of fundamental rights interpretation. As the GFCC has gradually established a very broad and far-reaching protection of fundamental rights, ultimately protecting any form of human action,Footnote 36 the need to deal with conflicts of fundamental rights has increased over time.Footnote 37 It makes sense to use balancing to resolve such conflicts, as none of the conflicting fundamental rights has to be restricted, but a case-specific compromise can be found.Footnote 38 The dynamics of fundamental rights jurisprudence thus also speak in favor of an increase in the use of balancing language. Both strands of argumentation on temporal dynamics lead to the second hypotheses:
H2: The Court increasingly uses balancing language over time.
Both hypotheses, thus, allow us to empirically test important assumptions of the critique of the GFCC and shed light on how certain legal argumentation techniques are used by the court.
II. Empirical Research
Although research on balancing and proportionality is extensive, empirical research is rather limited.Footnote 39 In the following, the main empirical contributions on the use of balancing by the GFCC are discussed.
The most important contribution is undoubtedly the comparative study by Petersen.Footnote 40 It shares with the present study the guiding question of whether constitutional courts use balancing to promote judicial activism.Footnote 41 Methodologically, an argumentation analysis is conducted for constitutional courts in Germany, Canada, and South Africa. For the analysis, Petersen has created a category system of argumentation types, which also includes the steps of the proportionality test.Footnote 42 The object of Petersen’s study on the GFCC are decisions in which laws have been declared invalid.Footnote 43 For the GFCC, Petersen observes great variability over time and concludes that balancing is the dominant, but not the only, decisive argumentation technique.Footnote 44 However, at no time was it the case that the majority of cases examined were decided on the basis of balancing. Rather, the tendency of the Court to limit itself in balancing can be observed, which is why its use is far less problematic than the criticism accuses it of being.Footnote 45 Petersen’s study is easily one of the best empirical works on the GFCC. However, the selection of cases (n=241) leads to limitations. Petersen can only make statements on those cases in which laws are rejected.Footnote 46 It is, therefore, not possible to make an empirical statement on the basis of Petersen’s data as to whether there is more balancing in decisions dismissing legislation than in others.
A second empirical study of the GFCC is published by Lang.Footnote 47 The main interest of the work is the contexts of application of the proportionality test. Based on a pre-selection through a keyword search, 114 decisions between 2000 and 2017 were analyzed manually according to a system of categories.Footnote 48 Again, there are limitations in the case selection, which is restricted to decisions with typical keywords of the proportionality test.
In a third noteworthy study, again on the GFCC, Stohlmann and his colleagues randomly selected 300 decisions and annotated the proportionality test at the sentence level.Footnote 49 This resulted in very fine-grained data on the occurrence and internal structure of the test.Footnote 50 The Article aims to describe how the proportionality test and its application have evolved. However, this approach is not concerned with balancing as such, only as a part of the proportionality test.
Apart from research on the GFCC, it can currently be observed that more and more innovative research designs are devoted to balancing in court decisions. For example, Steiner and her colleagesFootnote 51 use an experiment to investigate the relevance of balancing in the proportionality test. Zufall and her colleagesFootnote 52 design a formal model to capture balancing between selected fundamental rights of the EU Charter of Fundamental Rights (“EUCh”).
The present study also relies on novel research designs by using methodological text-as-data innovations. They allow the automated analysis of the entire body of GFCC case law and, thus, comparisons between decisions rejecting and not rejecting legislation.
III. Balancing Language
It is often disputed what exactly is meant by balancing.Footnote 53 The literature on the GFCC, therefore, often focuses on the—explicit—proportionality test,Footnote 54 without taking into account that the Court also balances outside the proportionality test. In contrast, the object of this study, the balancing language of the GFCC, uses a conceivably broad understanding of balancing. It understands balancing as an argumentative technique that can be recognized by certain legal language, typical phrases and formulations.Footnote 55 What matters here is the language used by the court not the presence of a full proportionality test.
In some cases, the court also explicitly states that it is “balancing” or “weighting.” Thus, in the case law one can find formulations such as: “When weighing up the public interest which the provision serves against the severity of the encroachment, the provision is found to be (only just) reasonable in general.”Footnote 56 Such explicit references also include other key terms, such as the naming of the Verhältnismäßigkeitsprüfung (proportionality test) or its balancing step Angemessenheit (appropriateness).
However, it cannot be assumed that the GFCC always names and makes explicit its own methodological approach. Balancing or weighting is often carried out without being named as such. In this sense, the measurement also needs to find expressions of the execution of balancing. Such formulations could be as follows: “The more concretely and directly a legal interest is placed in danger by an expression of opinion, the less stringent are the requirements when it comes to an encroachment; the more indirect and distant the threatening violations of legal interests remain, the greater are the requirements to be made.”Footnote 57 Accordingly, the language of balancing also includes relational formulations, which can be identified by keywords such as umso, desto (the more), or größer (greater).
This operationalization of balancing, thus, targets both explicit naming and relational execution. At the same time, certain linguistic expressions and forms of argumentation, such as categorical or deductive ones, are excluded. The aim of the next section is to introduce a heuristic with which balancing language can be measured in all senate decisions of the GFCC. A new dataset is, therefore, created to test the hypotheses, whereby the evaluation of the entire case law allows comparisons between different groups of cases and is thus able to close the identified research gap.
C. Methods: Scaling Balancing Language
Scaling text means automatically locating the text on a meaningful scale.Footnote 58 Pioneers in the use of these techniques for empirical research include political science, which has developed procedures to position political texts on ideological left-right scales.Footnote 59 A comparatively new heuristic is the use of word embeddings for these purposes. Word embeddings are used not only in political science for political right-left scales,Footnote 60 but also for sentiments,Footnote 61 or in sociology to capture stereotyping.Footnote 62
Text-as-data methods are now increasingly used for legal textsFootnote 63 and tasks, such as locating decisions on political ideology scales.Footnote 64 Some work is already using word embeddings for more innovative decision scaling, such as Dyevre,Footnote 65 who locates GFCC decisions on a pro/anti-EU scale, or Ash et al.Footnote 66 to measure judicial sentiments. In general, however, it is quite difficult to scale judicial decisions, which is why questionable proxies are often used.Footnote 67
This Article aims to scale the intensity of the use of a legal reasoning technique. It is an attempt to investigate latent semantic scaling procedures for a new research object, legal argumentation. After introducing the data and the main independent variable in Section C.I., the functioning of the word embeddings will be explained and the model specifically created for the GFCC will be presented in Section C.II. Finally, the scaling procedure will be introduced in Section C.III.
I. Data
The project uses the GFCC full-text corpus.Footnote 68 The research questions of the Article focus on the decisions of the senates—excluding chamber decisions. Accordingly, all senate decisions of the Court between 1951 and the year 2021 that have been published in the official collection and have a substantial reasoning have been selected. However, ninety very atypical decisions were not taken into account. Decisions with less than forty and more than 10,000 tokens justification are excluded. On this basis, 3,274 decisions are analyzed. The hypotheses refer to the legal reasoning of the court, so the presentation of the facts, including the statements by the parties, in the decisions is not taken into account for any of the variables used in this Article. As pre-processing, the text was segmented into tokens using spaCy, a software library for natural language processing, and stop words were removed.Footnote 69 The corpus comes with a meta-datasetFootnote 70 that provides information on the date of the decision, the senate as well as the type of proceeding.
However, the independent variable that captures whether or not a law is overturned in a decision had to be created. The unconstitutionality of a law is determined by the GFCC in the introductory statement. In doing so, the court uses standardized language. It writes that a law is unvereinbar (incompatible) with the Basic Law and is, therefore, nichtig (void). Based on these typical expressions, regex rules were created to search for the relevant decision passages.Footnote 71 Accordingly, in 559 of the total 3,274 decisions, laws were at least partially overturned.
II. Word Embeddings
Measuring balancing language, the dependent variable, is more difficult because the language used by the court is more diverse. Word embeddings offer a solution to this problem. They are a technique that represents words as vectors, which means locating them in a high-dimensional vector space.Footnote 72 Each word is represented as a vector, with the geometric relationship of the vectors to each other representing semantic meaning.Footnote 73 For example, words with similar meanings are closer together in vector space. The information that can be incorporated into word embedding models can be quite rich, even allowing analogy inference.Footnote 74
For empirical social science, it is particularly relevant that word embeddings can be used to capture latent constructs, such as political ideology.Footnote 75 In practice, this is done on the basis of so-called seed words, terms that ideally represent the language use of interest. A new vector representing the abstract construct can then be calculated from the vectors of the seed words. This new vector can be used to measure latent dimensions of meaning, such as gender stereotypes, political ideologies or balancing language.Footnote 76 Figure 1 schematically illustrates how a vector is created from the vectors of the seed words intensiv (intensive), desto (the more), Verhältnismäßigkeitsprinzip (principle of proportionality), and Abwägung (balancing), which is supposed to represent abstract balancing language.

Figure 1. Conceptual plot of different word vectors. A new vector, an abstract representation of balancing, is created from four seed word vectors: Intensiv (intensive), desto (the more), Verhältnismäßigkeitsprinzip (principle of proportionality), and Abwägung (balancing).
The quality of this approach depends largely on the suitability of the word representations, that is, the language model used. As this project aims to scale legal texts, the word embeddings model must also be able to represent legal language correctly.Footnote 77 To this end, a custom model was created based solely on GFCC jurisprudence to represent as accurately as possible the linguistic practice of this particular court.
To create a word embedding model, a neural network is used, which is given the task of predicting the correct word from the context of its use.Footnote 78 To do this, arbitrary sentences are taken from the GFCC case law, in which individual words are masked. The neural network is then instructed to infer these words from the context and learn from its incorrect predictions. During this training, which is also repeated for all words, the model optimizes its predictions and obtains a 200-dimensional vector representation for each word. The GFCC model was trained using the implementation of the software library Gensim.Footnote 79
A simple face validity check can be used to verify that the training was successful. This is done by listing the most similar vectors of a key term and checking whether there is also a semantic relationship in the resulting terms. Similarity in this context means the geometric cosine similarity of two word vectors.
In order to test the model training, a term that has a specific technical meaning in the constitutional rights adjuncts of the GFCC, namely Willkürverbot (interdiction of arbitrariness),Footnote 80 is used to see if the most similar words provided by the newly trained model make sense. As the most similar vector, the model first outputs the same word in a different grammatical case (Willkürverbots).Footnote 81 Next, come the terms Verhältnismäßigkeitserfordernissen (proportionality requirement) and Gleichheitssatz (principle of equality). Thereby, the language model shows that it grasps genuinely doctrinal relationships from the case law because the court examines the general principle of equality based in Article 3.1 of the Basic Law, with both the proportionality test and the arbitrariness test.Footnote 82 The next most similar words Willkür (arbitrariness) and Verstoß (infringement) are also plausible in context and suggest that the training was successful. Based on the new GFCC word embedding model, the next step is to introduce the methodology for implementing the scaling.
III. Measures
For the implementation, two seed word lists and two methods for scaling text passages are proposed in the following, resulting in four different measurements. Only after the subsequent validation, mentioned in Section D.I., will it be decided which measurement is the best and will be used to test the hypotheses. Two independent seed word lists were created using two approaches. The first is a minimalist list consisting of only four seed words: Intensiv (intensive), desto (the more), Verhältnismäßigkeitsprinzip (principle of proportionality), and Abwägung (balancing). These four words were taken from the example sentences discussed above in the introduction to the concept of balancing language, discussed in Section B.III. Thereby, this approach limits the task of selecting seed words to a minimum and relies heavily on the generalization through the embedding model.
For the second list, an extended list of seed words was created. Because the selection requires excellent domain knowledge, the seed words were chosen in discussions with two professors of German public law. Both can be considered experts in GFCC jurisprudence, as both have been involved in proceedings before the court. In order to select the words, the professors were first asked to compile as many terms as possible that they felt represented balancing in the jurisprudence. This list was supplemented by word suggestions from Bomhoff.Footnote 83 With this initial list of words, random sentences were drawn from the case law and then examined to see if the court had actually balanced. False terms were removed. The final extended seed word list contains fifteen terms.Footnote 84 The seed word lists are each used to create balancing vectors, which abstractly represent balancing language.
Two methods are used to scale the decision texts. In both cases, the scaling is done at the paragraph level. Paragraphs represent closed argument units structured by the court. The first method uses the word embedding model to obtain an extended and improved dictionary.Footnote 85 This is done by taking the sixty most similar terms for the newly created balancing vectors. The Appendix contains the two dictionaries, which also give an initial face validity that the procedure can indeed capture balancing language. The next step is to count the number of occurrences of the words in the dictionary for each paragraph. Thus, the method used is a dictionary approach, whereby the word embedding model improves the word selection, that is, reduces the selection bias.
The second method uses word embeddings to also represent the paragraphs as vectors.Footnote 86 A vector is computed for each paragraph. Each paragraph vector is a mean of the vectors of the occurring words multiplied by a weighting factor for each word.Footnote 87 The weights are inversely proportional to the frequency of each word in the corpus, so that rare words have more impact on the averaged vector. The result is a vector representation for each paragraph. Finally, for each of these paragraph vectors, the cosine similarity to the two balancing vectors, one vector for each seed word list, is calculated. This method can be called a paragraph embedding approach.
Thus, four different measures of balancing language are made for all paragraphs: (1) Dictionary approach on original seed words, (2) dictionary approach on extended seed words, (3) paragraph embeddings on original seed words, and (4) paragraph embeddings on extended seed words. Finally, all measures were normalized by subtracting the mean and dividing by the standard deviation.
All four measures are supposed to measure balancing language in the GFCC case law. The scaling is intended to represent the intensity of balancing language in the given text passage. The following section examines whether the measure is sufficiently robust and which of the four measurement methods provides the best results.
D. Validation
Validation of the measures is crucial, as the performance of text-based methods is highly dependent on the area of use.Footnote 88 No in-depth experience has been gained with texts from German court decisions.Footnote 89 Therefore, it is particularly important to ensure that the measures actually measure what they are supposed to measure. In the following, the validation will be performed on three datasets in order to demonstrate the suitability of the methods used. The basis for the first validation is a newly created dataset in which three human annotators coded balancing in 1,500 paragraphs, as discussed in Section D.I. This dataset has been created for this Article and will serve as a basis for deciding which of the measures introduced above works best. Subsequently, two additional external data sources are used to ensure the validity of the measurement. On the one hand, data from PetersenFootnote 90 are used, who hand-coded the argument type of 240 decisions, discussed in Section D.II. On the other hand, data from Lüders et al.Footnote 91 is used, who annotated the proportionality analysis at sentence level for 300 decisions of the GFCC, as covered in Section D.III.
I. Balancing Language
For the first validation, new annotation data were created. Three students each hand-coded 1,500 paragraphs. All three were studying law at German universities and were at the end of their studies for the first state examination. The 1,500 paragraphs were randomly selected from the case law and each coder was asked to say for each paragraph whether balancing was applied or not. A short codebook with guidelines and examples was provided for this purpose.
There was agreement between the three coders on this task in seventy-eight percent of cases. In 1,142 cases there was agreement that there was no balancing. However, it is striking that one annotator recognized balancing in 326 cases, while the other two only recognized it in seventy-seven and seventy-three cases respectively. In only thirty-two cases did all three agree that there was balancing. In a further fifty-four cases at least two coders agreed, whereas in 272 cases, only one coder annotated balancing. Thus, agreement between annotators is not particularly strong.
There may be several reasons why the outcome of the task is so inconsistent. Of course, human annotations are always influenced by errors to some degree. However, the domain of the task certainly plays a particular role. The annotators stated afterwards that they found it difficult to read only isolated paragraphs and would have liked to know more context for the decision. It should also be remembered that it is not part of legal training to analyze in depth the type of arguments deployed by constitutional courts and that other studies using lawyers for coding also find high variance in their results.Footnote 92
Nevertheless, these data can be used to validate the measure.Footnote 93 From a methodological point of view, it is crucial that they are not understood as a gold standard,Footnote 94 but as an expression of a legal routine, for which it is obviously not clear in many cases whether balancing takes place or not. Against this background, an additive index is created from the three annotations, ranging from zero to three. The zero here expresses that no balancing language occurs, while three represents the cases in which balancing clearly occurs. The two values in between represent intermediate levels that display the intensity of the balancing language. Human disagreement in coding is, thus, understood as reflecting the strength of the linguistic expression of balancing language in the text.Footnote 95 Accordingly, it is crucial for the validation of the automated measurement that it captures this scale.
Figure 2 shows the average of the four measures of balancing language—including bootstrapped ninety-five percent Confidence Intervals—against the additive index of human annotation. All four measures are generally correct in their tendency to represent the additive index. The measures based on paragraph embeddings slightly outperform the measures based on dictionary approaches.Footnote 96 This is because the latter struggle with the first stage of the index and have much larger confidence intervals. Among the seed word lists, the extended list performs better, although it is quite surprising how well the original list performs, given that it consists of only four chosen seed words. Overall, the correlations are not particularly strong, which means that there is a lot of variance. However, the correlation of the best measure, paragraph embeddings on extended seed words, is close to r=0.5, which is sufficient given the difficulties the human annotators had and the typical scores for such validations otherwise.Footnote 97

Figure 2. Human annotation of balancing language (x-axis) against automated measures (mean with bootstrapped ninety-five percent Confidence Intervals). The measures of balancing language (y-axis) are each normalized (Mean = 0; SD = 1).
The first step has thus shown that the measurement of balancing language can map human intuitions, although these turn out to be not very consistent. The measure based on the paragraph embeddings on the extended seed word list turns out to be the most appropriate, so only this measure will be used in the remainder of the Article.
II. Decisive Arguments
The second validation uses data from Petersen.Footnote 98 Petersen conducted an argumentation analysis at the decision level, for which a system of ten argumentation categories was developed. The aim was to capture which argumentation techniques were crucial to the GFCC’s decision, with the possibility of more than one category being assigned to a decision. One category in the annotation scheme is balancing. Thus, for validation purposes, it is expected that more balancing language will be used in decisions where balancing has been annotated by Petersen.
A linear ordinary least squares regression was computed for validation.Footnote 99 The dependent variable is the balancing language—paragraph embeddings on extended seed words—with averages calculated for each decision. The independent variables are the ten dichotomous argumentation categories. Figure 3 visualizes the estimates. The balancing argumentation category has a strongly positive estimate. In contrast, as expected, there are no significant or strongly positive estimates for the categorical arguments, as well as for consistency, coherence, and deductive reasoning. The estimate for the rational connection and less restrictive means test category, the techniques corresponding to the second and third steps of the proportionality test, is surprisingly strong. However, Petersen also claims to have found a lot of implicit balancing within the less-restrictive-means test.Footnote 100

Figure 3. Regressions Estimates with ninety-five percent Confidence Intervals for a linear OLS model. Dependent variable: Balancing language (paragraph embeddings on extended seed words). Only those predictors whose error bars lie beyond the vertical 0 axis are significant.
III. Proportionality
For the third validation, a dataset is used in which proportionality tests have been annotated in 300 decisions of the GFCC.Footnote 101 Each sentence of the decisions was annotated based on whether it could be assigned to one of the steps of the proportionality test. In addition, there is a category for passages that obviously belong to a proportionality test but cannot be assigned to any step or can be assigned to several steps. In forty-nine of the 300 decisions, at least one of the proportionality test steps was annotated. To validate the measurement of balancing language, these forty-nine decisions are checked to see which passages use more balancing language. Obviously, the balancing step is expected to measure the strongest balancing language.
Figure 4 shows the mean scores—with ninety-five percent bootstrapped Confidence Intervals—for balancing language for each of the steps. The highest mean score is for the category representing text passages that are part of a proportionality test but were either not assigned to any step or were assigned to multiple steps. The high score is not surprising, as these passages are short, mostly found as introductions to tests, and use several key terms. Among the steps of the proportionality test, balancing clearly has the highest mean score. Interestingly, for the necessity step, the part of the test in which a less restrictive mean test is performed, a significantly higher mean score was found. This again confirms Petersen’s observation on implicit balancing in the less restrictive mean test.Footnote 102

Figure 4. Mean balancing language—paragraph embeddings on extended seed words—for the proportionality test steps—mean with bootstrapped ninety-five percent Confidence Intervals.
Thus, the performance of the balancing language measure was validated with three data sources. All three validation strategies showed that the scaling of balancing speech behaved as theoretically expected. Although the identification of balancing language proved difficult even for human annotators, the validation performed here is sufficient reason to assume that the measure captures the relevant trends in GFCC jurisprudence.
E. Results: The Usage of Balancing Language
Having obtained a robust measure of balancing language for the GFCC jurisdiction, the Article now turns to the hypotheses introduced earlier. Therefore, the following Section E.I. first provides an overview of the variables and the analysis, followed by a discussion of the results in Section E.II.
I. Model
The variables used are summarized in Table 1. Balancing language is the dependent variable. The measurement uses the paragraph embedding approach on the extended seed word list. Each paragraph of the reasoning section was scaled individually and then averaged for each decision (n=3,274). As before, the measured values were normalized.Footnote 103 The independent variables relevant to the hypothesis are, on the one hand, the year of the decision and, on the other hand, the dichotomous variable of whether a law is overturned in the decision.
Table 1. Descriptive statistics of the data used.

The first question to be addressed is whether more balancing language is used in decisions overturning laws. A naive approach would be to compare the average use of balancing language between the group of decisions overturning laws and the group of decisions not overturning laws. On average, more balancing language is, indeed, used in decisions overturning laws than in decisions not overturning laws. This comparison of the averages between the two groups of decisions is in line with hypothesis H1. However, the explanatory power of such a naive comparison is very limited: There are likely confounding variables.
Therefore, this Article uses regression analysis to better understand the relationships between variables.Footnote 104 The regression model includes control variables in order to isolate the theoretically relevant relationship from other potential confounders. Three control variables are used for the following analysis. The first is a dichotomous variable for the senate. The GFCC operates in two independent senates, each with different responsibilities. Accordingly, it is plausible that they behave in different ways.Footnote 105 Second, the length of the reasoning section is controlled for, as the decisions of the GFCC vary greatly in length. Therefore, the number of tokens in the justification was counted. As the distribution is highly skewed, the measure was logarithmized.Footnote 106 Third, a dichotomous control variable is introduced for the different types of proceedings. The GFCC has a number of different proceedings that lead to different types of decisions. The variable used here groups together, on the one hand, right based review proceedings: Namely constitutional complaints, abstract as well as specific judicial review of statutes. In these proceedings, violations of fundamental rights can be examined by the court. On the other hand, the second group include all other types of procedures—such as disputes between the highest federal organs, what are called Organstreit proceedings; disputes between the Federation and the Länder; proceedings to ban political parties; or electoral complaints.
The analysis was carried out using linear ordinary least squares regression calculated with the stastistical programming language R. As the time effect is not necessarily linear, squared and cubic terms for the year variable are included in the regression model. Table 2 shows the estimated model parameters of the preferred specification.Footnote 107
Table 2. Coefficients of OLS linear regression model.

The first three coefficients estimate the time effect. The quadratic and cubic year terms suggest that there is no mere linear relationship between the use of balancing language and the year of decision. The estimated temporal dynamics are best understood in visualized form. Figure 5 shows the values of balancing language predicted by the regression model over time for decisions overturning laws as well as for the reference group. The use of balancing language increases until around the year 2000. From this point on, a decreasing use of balancing language is modeled. Thus, a temporal effect on balancing language was indeed found, but it only behaves as expected for a certain period of time. The development after the year 2000, however, contradicts the expectations of the hypothesis. In this sense, hypothesis H2 must be qualified.

Figure 5. Predicted values of balancing language over time for decisions overturning laws versus decisions not overturning laws—including ninety-five percent Confidence Intervals, adjusted for a right-based review proceeding of the first senate with median length.
Also surprising is the coefficient for the variable, which depicts whether or not a law was overturned in the decision. Hypothesis H1 states that decisions overturning laws use more balancing language, which the naive empirical comparison of averages seems to confirm. However, when regression analysis is used to control for other variables, the positive effect disappears.Footnote 108 In fact, in the preferred model specification presented above there is even an effect in the opposite direction. Again, the reference category here is the class of decisions that do not overturn a law. The negative coefficient of the model, therefore, means that in decisions in which a law is overturned, less balancing language is used than in the reference class. This result, thus, contradicts the expectations of hypothesis H1.
The control variables also have plausible coefficients. They show that the longer the reasoning section, the more balancing language is used. According to the model, the two senates also differ in their use of balancing language. The first senate forms the reference category, so the negative effect indicates that the second senate uses less balancing language than the first. It can also be seen that significantly more balancing language is used in right-based review procedures.
II. Discussion
The empirical results thus shed mixed light on the hypotheses initially proposed. The expectations regarding the increase in the use of balancing language over time can only be partially confirmed. In its first years, the GFCC was more reticent and balancing in constitutional adjudication was not yet an established argumentation technique. There followed decades of increasing use of balancing language as the court became a stronger institution.Footnote 109 However, this trend appears to have changed after 2000, as the court seems to have become more reluctant. Such a trend reversal was not expected in the literature and is therefore an interesting empirical result. However, it is consistent with the findings of Petersen, who observed a decline in the 2010s.Footnote 110 Stohlmann and his colleagues also report an increase in the use of the proportionality test in the early decades, but conclude that it is by no means omnipresent.Footnote 111 Even in recent years, there is a large proportion of decisions in which proportionality is not invoked.Footnote 112
The model shows plausible coefficients for the control variables. That more balancing language tends to be used in longer resoning sections is obvious, because balancing in constitutional rights adjudication requires a great deal of argumentative space. It is also apparent that more balancing language tends to be used in right based review proceedings. The balancing debate is, after all, about balancing fundamental rights.Footnote 113 The model also shows an effect for the senates. Both senates work independently and the results seem to indicate that there are differences in the usage of balancing language, with the first senate using more balancing language than the second senate.
The hypothesis that was ultimately the starting point of the Article could not be confirmed. It was assumed that there would be more balancing in decisions overturning laws. However, when other variables were controlled for in the model, the expected positive effect disappeared. In the preferred specification the model even shows the opposite effect. Decisions that overturn laws tend to use less balancing language than other decisions. This insight was made possible by a new data set and new methodological approach. However, the findings are consistent with those of the other empirical studies. Petersen, who looked only at decisions overturning laws, concluded that balancing is less decisive in the GFCC jurisprudence than critics suggest.Footnote 114 Based on detailed argumentation analysis, Peterson states: “The preceding analysis has shown that, in most cases, the German Federal Constitutional Court does not use balancing to correct fundamental value decisions of the legislature.”Footnote 115
As a result, the findings of several empirical studies question the factual basis of the vehement criticism of balancing by the GFCC. They arrive based on empirical insights at a different description of the connection between this particular argumentation technique and judicial activism. Accordingly, a theoretical contextualization is necessary.
In order to contextualize the empirical findings with regard to balancing, first of all, it must be emphasized that the critics’ initial thesis is very broad. They see balancing as a method that is employed across the entire case law, that serves a strategic purpose, and that is increasingly being used. By contrast, a substantial body of German public law research argues that the GFCC uses balancing in a nuanced way. The literature emphasizes the sophisticated application requirements and mechanisms of proportionality and balancing.Footnote 116 This calls for a much more differentiated view of the court’s work. Therefore, the empirical findings presented in this study may not come as a surprise. But that requires critics to come up with better and more specific explanations and hypothesis. The general suspicion cannot persist.
This raises the question of what can be inferred about the strategic use of balancing. It seems that the GFCC is more defensive in the majority of potentially confrontational decisions than assumed in the literature.Footnote 117 The recognition of the contentious nature of balancing is a good reason for the court to refrain from balancing. Since the first studies of judicial behavior, it has been argued that constitutional courts undermine their own position if they are perceived as activist actors.Footnote 118 Not using controversial techniques may, therefore, prove to be a sensible strategy.
However, it would be misleading to conclude from the empirical results that the era of balancing and proportionality is over. The so called PSPP Decision of the GFCC on the Public Sector Asset Purchase Program of the European Central Bank, which falls within the last year covered by this study, is an extremely prominent and controversial counterexample.Footnote 119 In this decision, the GFCC declares a decision of the Court of Justice of the European Union (“CJEU”) to be ultra vires, largely on the grounds that the CJEU did not adequately apply the proportionality test—according to the GFCC.Footnote 120 On the one hand, the decision proves the role that proportionality can play and the extent to which it is distinctly German legal heritage. On the other hand, it is also proof of the diverse ways in which this concept can be applied. The use of proportionality is an atypical exception.Footnote 121 The case concerns the non-application of proportionality by another court, not the justification of a judgment on the basis of balancing. Accordingly, the methodological approach chosen here is certainly not the most suitable for this special case, which provoked whole debates.Footnote 122
Beyond balancing, the relevance of the present study for the relationship between legal argumentation and judicial activism are worth considering. The study focused on a very specific mechanism. It assumed that a particular argumentative technique, the use of balancing, was associated with a particular form of judicial activism, the overturning of laws. The presumed correlation could not be proven empirically. This is surprising, given the international academic attention.
Of course, this does not mean that there is no connection between the court’s legal argumentation method and judicial activism. There are two aspects to consider. First, judicial activism goes beyond simply overturning laws. The PSPP decision is a good example of this. It is obvious that this decision must be considered judicial activism; at the same time, this decision does not overturn any law. It is therefore worth considering the development of a different criterion of judicial activism for empirical research.
Second, there are other argumentation techniques and other forms of judicial discretion. Balancing is, of course, only one specific form of constitutional argumentation. There are many forms of judicial discretion used by the GFCC. One example is the interpretation of the eternity clause of Art. 79(3) of the Basic Law, which prohibits the amendment of certain parts of the constitution, although the GFCC understands it as a limitation to European integration.Footnote 123 There are also a number of rulings on distributive issues that derive a substantive minimum from human dignity, Art. 1(1) of the Basic Law, in conjunction with the principle of the social welfare state, Art. 20(1) of the Basic Law, which oblige the legislature to maintain certain standards.Footnote 124 Thus, there are several examples of what can be understood as judicial discretion—apart from balancing—that have provoked strong criticism.Footnote 125 However, in order to conduct empirical research into the toolbox of legal activism or judicial discretion, theories are needed that provide an explanation for the systematic use of certain argumentative tools.
Florian Meinel has recently made such a contribution, which is currently the subject of much discussion. Meinel argues that over the last 15 years the GFCC has undertaken a far-reaching reinterpretation of the German constitutional order.Footnote 126 Proportionality is seen as an important methodological tool.Footnote 127 However, the main focus is on the way in which the court interprets the law on the organization of the state. Meinel identifies a systematic technique used by the court—the application of administrative law standards to parliament and government. According to Meinel, “[t]he Court’s constitutional language has systematically levelled any difference between political institutions and administrative authorities.”Footnote 128 Meinel’s analysis is instructive in two ways. On the one hand, it fits with the findings presented here. Meinel’s diagnosis is consistent with the decline of balancing language in recent years. The GFCC focuses less on fundamental rights and more on the interpretation of competences in the law of state organization, where balancing is less common. On the other hand, Meinel formulates an alternative theory of the argumentative means used by the Court. Meinel sees the application of administrative law standards to parliamentary institutions as a systematic, strategic approach by the GFCC.
Overall, the relationship between judicial argumentation and judicial activism remains an important question for empirical research. The aim of this study was to provide quantitative insights into the use of a particular argumentative techniques. Therefore, a major focus of the Article was to demonstrate that quantitative textual analysis methods can be reliably used for this task. This study is thus one of the first empirical contributions to provide insights into the strategic use of argumentation techniques for such a large number of cases.
At the same time, of course, the empirical evidence presented here comes with limitations. It is obvious that judicial reasoning is more complex and multifaceted than can be captured by measuring balancing language. In this sense, more effort is needed in the future to do empirical justice to the phenomenon of legal reasoning.Footnote 129 The analysis strategy used also has its limitations. This Article uses a simple linear regression model with only a few control variables. This Article itself demonstrates the importance of the choice of control variables. Accordingly, it is essential for further empirical analysis of balancing, but also for legal reasoning in general, to gain a deeper understanding of how activist decisions differ from others. This requires, in particular, conceptual work for better empirical measures. There are also a growing number of attempts to use specific research strategies and methods to carry out causal analyses.Footnote 130 Such analyses promise more rigorous results, but they are also much more demanding.Footnote 131 In particular, they require a causal theory of how courts decide and argue. The present study is more modest in this respect. It does not claim causality, but merely points out that decisions overturning laws do not simply use more balancing and the use of balancing has not simply continued to increase. In any case, this empirical work calls for new empirical and theoretical work on the Court’s strategic use of judicial argumentation techniques. In many ways, therefore, this Article demonstrates that the use of textual methods for the study of judicial decisions still has much potential.
F. Conclusion
This Article used word embeddings, a state-of-the-art approach to text analysis, to empirically investigate a specific argumentation technique, namely balancing, in the GFCC jurisprudence. This demonstrates the potential for judicial behavior research to use text-as-data methods to illuminate judicial argumentation. Contrary to theoretical expectations, the GFCC does not simply use more balancing language in cases where laws are overturned. It has also been shown that the use of balancing language only increased in the first fifty years of its existence until around 2000, as had been assumed. Since then, a decline has been observed. Thus, it appears that empirical research on the behavior of courts can fruitfully contribute to the theoretical debate on constitutional courts and their reasoning. For the debate on balancing in GFCC jurisprudence, it is no longer possible to make the sweeping claim that the court is more likely to engage in balancing when confronting the legislature.
Acknowledgments
In particular, I would like to thank Alexander Tischbirek for always being sure that there is such a thing as balancing language, and Christoph Möllers for always being sure that I could measure it. I would also like to thank the three annotators, and my colleagues Benjamin Engst, Anselm Hager, Christian Rauh, Bent Stohlmann, Nils Weinberg, Luisa Zimmer, and Lisa Zehnter for their helpful feedback.
Competing Interests
The author has no competing interests to declare that are relevant to the content of this Article.
Funding Statement
The work was funded by the DFG Leibniz Prize for Prof. Dr. Christoph Möllers, LL.M., which was awarded by the public German Research Foundation (Deutsche Forschungsgemeinschaft).