I. Introduction
With its release in November 2022, ChatGPTFootnote 1 has caused a hype around Large Language Models (LLMs) and fuelled the discussion around the chances and risks of AI models.Footnote 2 Not only private end users and enterprises but also authorities are exploring general-purpose models such as ChatGPT, ClaudeFootnote 3 or Gemini.Footnote 4 They are experimenting with their own domain-specific LLMs created by means of fine-tuning pre-trained models with domain-specific training data.Footnote 5 , Footnote 6 Also, there are initiatives to use such models to help individuals to understand decisions concerning them and effectively take legal action.Footnote 7 While the models could increase the efficiency of administrative decision-making procedures and make legal aid available to vulnerable groups, there remain concerns about LLMs’ functional limitations and compliance with legal requirements especially in the fields of data protectionFootnote 8 and administrative law. Existing sources on the practical and legal issues are limited insofar as technical studies on existing LLMsFootnote 9 do not generalise across different models and domains, and legal literature on automated decision-making largely refers to automated systems or machine learning models in generalFootnote 10 rather than to LLMs in specific.
This article provides an overview of the potential and challenges of LLMs in the context of administrative decision-making. As the analysis requires a sufficient understanding of the technical functioning of LLMs, section II provides a brief introduction into the functioning of existing LLMs. Section III outlines potential use cases for LLMs in the context of administrative decision-making on both sides, the authorities on the one hand and concerned individuals on the other hand. Selected practical and legal challenges are discussed in section IV. Section V draws a conclusion and an outlook to future research.
II. Existing large language models (LLMs) in a Nutshell
Understanding LLMs requires an idea of the underlying architecture and the training (section II.1). Also, it is necessary to delineate non-generative from generative models (section II.2). Furthermore, the use of personal data for the training of the models and the storage of training data in the model parameters in such a way that the training data can be extracted (“memorisation”) must be taken into account (section II.3). The introduction is based on existing general-purpose models, ie, models that have the capability of serving a variety of purposesFootnote 11 such as ChatGPT, Claude, Gemini or BERT which only allow cautious conclusions about the potential properties of future models.
1. Architecture and training
LLMs are based on machine learning, the – at present – most relevant subfield of artificial intelligence (AI). The architecture of the models is referred to as artificial neural networks (ANNs), complex structures of mathematical instructions represented in a high number of nodes and connections between them.Footnote 12 An untrained ANN has unspecified (respective random) parameters that serve as placeholders for information. A model with specified parameters is built through training with usually large amounts of training data.
Training an LLM requires training data in the form of text. General-purpose models such as ChatGPT, Claude, Gemini or BERT, for example, are trained not only but also with large amounts of text data that is publicly available online, eg, books, research articles, Wikipedia articles and other websites.Footnote 13 For the training, HTML code and URLsFootnote 14 are usually eliminated from the texts,Footnote 15 and the cleaned texts are broken down to “tokens”, ie, sentences, parts of sentences, words, parts of words or even single characters. During the first phase of the training, the model is iteratively fed with incomplete sentences and predicts missing word(s), eg, the next one.Footnote 16 When this phase of training is completed, a model can generate outputs with correct syntax. To perform sufficiently with respect to semantics, the model is trained on questions through reinforcement, either based on question-answer-pair databases,Footnote 17 or on human feedback.Footnote 18 However, generative LLMs that generate syntactically and semantically correct outputs, do not understand grammar rules or the meaning of words.Footnote 19 Instead, existing models merely operate with statistical relationships between words. As machine learning models, in contrast to rule-based AI systems,Footnote 20 they derive patterns from the training data without explicit, interpretable rules. LLMs “learn”, based on the training data, which words in which order are likely in a specific context and, thereby, adjust their parameters such that it can generate highly probable word combinations.Footnote 21 Because LLMs are opaque machine-learning models, it is not possible to determine how a particular input led to a particular output, ie, to trace elements from their output to elements from the input with certainty.
2. Generative vs. non-generative LLMs
Non-generative LLMs such as BERT are used to summarise, translate or classify text. Generative models such as ChatGPT, Claude, or Gemini can perform all these tasks. In addition, they are able to creatively generate novel texts. The goal of creating novel texts, ie, texts that to some extent differ from the statistical patterns in the training data, seems to contradict the training goal of the models to generate highly likely outputs as these are per se similar to the training data. To create something new, generative LLMs use so-called temperature parametersFootnote 22 : Setting the temperature parameters low results in more deterministic outputs, ie, more likely word combinations that more closely follow the patterns and probabilities observed in the training data. Higher temperatures increase the variance of the generated outputs, leading to more diverse and creative responses that deviate further from the training data. This is because temperature parameters determine to which extent a generative model mixes statistically less likely words in its outputs. However, adjusting temperature parameters is tricky: Even a slightly too high temperature causes a model to generate meaningless gibberish. However, even “well-tempered” generative LLMs that were trained on correct data are not fully reliable as they often fabricate incorrect information (“hallucinations”).Footnote 23
As this article explores the use of LLMs to generate complete administrative decisions, analyses or complaints (see section III), it focuses on generative models. For the following sections the term “LLMs” without additional specification refers to generative LLMs.
3. The use and “Memorisation” of personal data in the training of LLMs
Training data for existing LLMs contain personal data, for example, the names of authors of research articles and books, and of any real persons referred to in books, research articles, Wikipedia articles and texts from other websites. It is usually impracticable if not impossible to completely clean the large amounts of training data from personal data. Also, the models are specifically trained with personal data such as names and addresses so they “learn” how this information fits in texts.Footnote 24 Future breakthroughs in anonymisation techniques – especially based on LLMsFootnote 25 – might allow for anonymising large amounts of text in an automated matter while preserving all relevant information. What results can be achieved with completely fictionalFootnote 26 or synthetic data is an open research question and depends on the intended use cases.
It is important to distinguish training LLMs with personal data on the one hand and prompting them with personal data on the other hand. When prompted with personal data (for example, “Who is [name]?”), models usually generate outputs that include personal data (for example, “[name] is known for her expertise in the field of computer science”). However, most existing LLMs are not self-learning, ie, they do not use information from a prompt in other sessions. By contrast, during training, the models “memorise” information from the training data, ie, that information is reproducibly stored in the model parameters. For terms and texts that are prevalent in a large subset of the training data this happens intentionally. This ensures, for example, that LLMs use common terms such as “artificial intelligence” rather than creating strange terms such as “artificial smartness”, and that they can provide information about historic events. “Memorisation”, however, also occurs unintentionally for pieces of texts that are prevalent only in a small subset of training data.Footnote 27 Albeit this phenomenon has not been extensively researched yet, experiments with existing models indicate that duplicates in the training data promote unintentional “memorisation”,Footnote 28 and that larger models with more parameters memorise more information from the training data than smaller ones.Footnote 29 In one experiment with GPT-2, the predecessor of the model that ChatGPT is based on, researchers were able to reproduce personal contact information from the training data. For ChatGPT, researchers could develop an attack to extract training data in a scalable matter.Footnote 30 It is likely that further breakthroughs in LLM research involve more efficient methods to extract training data. When addressing memorization, attention should be given not only to personal data but also to copyrighted material and trade secrets within the training data.
III. Potential use cases for LLMs in the context of administrative decision-making
When discussing potential use cases for LLMsFootnote 31 in the context of administrative decision-making, two perspectives should be considered. On the one hand, LLMs might prove useful in the decision-making process (section C.I). On the other hand, LLMs could provide new tools for concerned individuals (section C.II).
1. LLMs in the decision-making
LLMs are able to perform various minor tasks that can enhance decision-making processes, eg, they have been found useful for processing and summarising large amounts of texts, or for generating ideas or multiple potential solutions for a specific problem.Footnote 32 However, LLMs could also play a major role in decision-making, particularly as they allow for the development of decision-support systems that draft complete decisions. LLMs can create texts with a high variance. In the context of administrative decisions, this makes them a promising tool for the automated generation of complex individual explanations that are not merely based on text blocks.
Taking an application for a building permit as an example, a simple LLM-assisted decision-making process could look as follows:
-
(1) The applicant files an application for a building permit with the competent authority and hands in all required documents electronically and in a standardised format.
-
(2) The information from the application is fed into an LLM-based decision-support system.
-
(3) The LLM drafts a complete decision including an explanation, and the LLM-based decision-support system provides the competent decision-maker with this draft.
-
(4) The competent human decision-maker checks the decision and issues either a decision as proposed by the decision-support system, or a slightly altered decision, or a completely different decision.
In step 3, the decision-support system could either fully rely on an LLM, or combine an LLM with another component or other components.Footnote 33 In the latter alternative, the other component would determine the decision to issue or deny a building permit, and the LLM would be prompted to generate the decision text including the explanation for this given decision. In the former alternative, the LLM would freely generate the decision text and, by this, determine the decision to issue or deny a building permit. Both alternatives require a domain-specific LLM that has been trained on specific training data, comprising not only relevant legal provisions, legal commentaries, documents on the legislative process and court rulings on these provisions, but also examples of decisions. As it has not been clarified yet whether, and if so, under which conditions, training on synthetic data could achieve satisfying results and the effective anonymisation of training data is not yet possible with reasonable effort,Footnote 34 the training would have to rely on historical real-world decisions. Where there are not enough real-world decisions available to train a performant model, LLM-based data augmentation technologies can be applied to expand and diversify the training data set by creating variations of available data in an automated manner.Footnote 35
In step 4, the decision draft of the LLM serves as the starting point for the human decision-maker’s considerations. The human decision-maker’s task consists in thoroughly assessing the suggested decision rather than developing their own decision from scratch or composing it from existing text blocks and, usually, additional text for the individual case. Unlike text blocks, the draft of the LLM already consists of a complete and continuous text that has been generated for the specific case. Compared to writing a decision from scratch or composing it based on text blocks, taking an LLM’s draft as a starting point of administrative decision-making has the potential to largely reduce the time that human decision-makers take to extract decision-relevant information from complex case files and arrive at their decision.
LLMs could draft various other kinds of administrative decisions, eg, the decision of a data protection authority to impose an administrative fine on a data controller under the GDPRFootnote 36 , Footnote 37 the decision of an authority to ban certain slogans at a demonstration, or a financial supervisory authorities’ decision to deny an applicant the authorisation to provide financial services. LLMs could especially facilitate the efficiency of administrative decision-making where this requires processing large amounts of texts that take substantial efforts to read, analyse or write, ie, where the decision-making is based on comprehensive file material, or the decision requires extensive explanations. Furthermore, LLMs are capable of generating highly varied texts that cannot be composed from commonly used text modules, making them especially valuable in unique cases that demand a thorough individual evaluation.
2. LLMs as a tool for concerned individuals
Potential use cases for LLMs in the context of administrative decisions are not limited to the decision-making but the models could also prove useful for addressees of administrative decisions. LLMs could classify, translate, or summarise decisions albeit these tasks require especially high precision in the legal context. LLMs furthermore could provide in-depth explanations of decisions and possible responses, legal analyses to identify legal errors, and even draft complaints. Such tools could either support legal counsels of concerned individuals. They could make legal remedies available to especially vulnerable groups that lack access to legal counselsFootnote 38 by generating complaints that concerned individuals can directly file with the competent authority or court. A hackathon on generative AI held in June 2023 aimed to explore, among other things, tools based on LLMs to translate legislation and other legal documents into simpler language,Footnote 39 or to generate human right appeals for asylum seekers in an automated manner.Footnote 40
IV. Selected practical and legal challenges
Whether or not LLMs enable sufficiently good tools for both decision-makers and individuals who are concerned by decisions is an open research question. There are various concerns and doubts about both the practical feasibility of such tools and their legal compliance. The following sections introduce some of these challenges and, where applicable, highlight connections between them. Concretely, the article introduces practical and legal challenges regarding the suitability of LLMs for legal reasoning tasks (section IV.1), human oversight over LLMs (section IV.2) and the “memorisation” of training data (section IV.3).
1. Suitability of LLMs for legal reasoning tasks
Making, assessing or contesting an administrative decision involves solving legal reasoning tasks. Concretely, it is necessary to correctly apply legal norms to a specific case to draft, assess or contest an administrative decision. Legal reasoning can be highly complex as numerous legal norms can apply to a case and their application regularly involves interpreting what their often-ambiguous wording means.Footnote 41 Legal reasoning requires considering all relevant circumstances of a case. Administrative bodies are legally obliged to do so. On the EU level, the general principle of good (or “sound”) administrationFootnote 42 requires authorities to carefully examine all relevant facts of an individual caseFootnote 43 and investigate further where necessary. This requirement can also be derived from the right to good administration codified in Article 41 CFREU.Footnote 44 According requirements can be found in Member State Law. For example, § 24 paras 1, 2 of the German Federal Administrative Procedure Act (Verwaltungsverfahrensgesetz) explicitly require authorities to take into account all circumstances relevant to the individual case, and to investigate the facts ex officio. The law regularly grants administrative bodies discretionFootnote 45 in decision-making, meaning there can be multiple lawful outcomes to a case. However, for legal reasoning tasks, it is essential to ensure consistency that is also regarded a general principle in EU administrative lawFootnote 46 : On the one hand, an administrative decision, analysis or complaint must be consistent in itself, ie, especially not self-contradictory, on the other hand, administrative decisions must be consistent with other decisions to ensure compliance with principles of equal treatment, as laid down in Article 20 CFREU or Article 3(1) of the German Basic Law (Grundgesetz). Ensuring equality requires that similar situations are not treated differently and different situations are not treated identically without an objective justification.Footnote 47 However, the administration can change decision-making practices in the light of new technological, legal or societal developments. For example, new studies on training data extractionFootnote 48 can demand changes to a data protection authorities decision-making practice concerning the lawfulness of data processing with LLMs.Footnote 49
Legal reasoning is not limited to finding a (or the) lawful outcome (eg, grant/deny a building permit) but also includes providing its justification, i.e. a clear explanation of the decision. Administrative bodies are obliged by law to provide reasons for their decisions. This requirement can be found in Member States’ administrative procedural laws such as § 39 of the German Federal Administrative Procedure ActFootnote 50 that requires written or electronic administrative acts to be accompanied by a statement of grounds.Footnote 51 Furthermore, in EU administrative law, a duty to give reasons follows from Article 41(2)(c) CFREU, Article 296(2) TFEU, and as a general principle of law.Footnote 52 The main rationales of the obligation are the provision of information to the concerned individual on the one hand, and enabling courts and administrative bodies to review decisions on the other hand.Footnote 53 To enable concerned individuals to understand and challenge decisions concerning them and courts to exercise their power of review, administrative bodies must “disclose in a clear and unequivocal fashion [their] reasoning”,Footnote 54 stating in a concise and understandable manner all relevant facts, legal norms and decisive legal considerations.Footnote 55 The duty to give reasons not only ensures transparency but also facilitates a careful assessment of the case by decision-makers. The extent of the duty to give reasons, however, depends on the decision, and the requirements are particularly low for decisions that are in line with a consistent decision-making practice.Footnote 56
With respect to legal reasoning, the use of LLMs to draft, analyse or contest administrative decisions poses two main research questions:
-
(1) Feasibility of legal drafting with LLMs: Are LLMs fit to solve the legal reasoning tasks outlined in section III, ie, can they generate complete decision drafts, analyses or complaints that are consistent with the facts resulting from the case file and the applicable legal norms?
-
(2) LLMs and the duty to give reasons: Can administrative bodies who issue decisions that are drafted by LLMs fulfil their duty to give reasons and, if so, how?
In the context of research question (1), it is questionable whether and, if so, to which extent legal reasoning can rely on language in general (section IV.1.a) and on LLMs in specific (section IV.1.b). Furthermore, LLMs in the context of administrative decision-making must meet consistency requirements but also allow for changes of administrative practices (section IV.1.c). For LLMs to prove useful for legal drafting in the context of administrative decisions, errors and biases must be avoided and robustness against manipulation must be ensured (section IV.1.d). Section IV.1.e addresses research question (2).
a. Language as a limited medium of communication
Some experts point out that LLMs, in general, have limited abilities as they are based on language which is a limited medium of communication.Footnote 57 At first glance, this concern does not seem to apply to legal tech use cases as the law itself highly relies on language. However, legal language is particularly difficult as it is charged with ambiguous terms that are often inconsistently used in different legal acts and that therefore require careful interpretation not only based on the mere wording but also in the light of the specific relationships between legal provisions, their historical context, and their intent and purpose. For example, determining whether a controller can claim legitimate interest according to Article 6(1)(f) GDPR requires a thorough risk assessment and balancing of interests. To give another example, the term “Eigentum” that can be translated to property, in German civil law, refers to the ownership of tangible objects only,Footnote 58 while the constitutionFootnote 59 uses the term in a broader meaning, referring also to the ownership of rights. Legal laypersons using legal terms without knowing their exact legal meaning adds to the problem. LLMs can only prove useful for drafting, analysing and contesting administrative decisions if it is possible to engrain legal semantics in their parameters. Whether the models allow for that with the necessary precision, and, if so, which effort it takes to train and fine-tune such models, is an open research question that can be answered only through experiments and empirical studies with domain-specific LLMs, as results of attempts to train LLMs for legal use casesFootnote 60 and studies on existing general-purpose modelsFootnote 61 do not generalise across domains and to sufficiently fine-tuned models.
b. Limited reasoning abilities of LLMs
From a pessimistic point of view, it could be assumed that LLMs are simply unfit for making, analysing and contesting administrative decisions as existing models have shown limited reasoning abilities.Footnote 62 Indeed, there are limited studies to solve legal problems with public general-purpose models and fine-tuned modelsFootnote 63 that are largely phenomenological, limited to specific applications and models and do not permit drawing general conclusions.Footnote 64 What makes legal reasoning through LLMs especially challenging is that LLMs rely on statistical relationships, ie, correlations, while law relies on causalitiesFootnote 65 :
Taking a decision of a data protection authority to impose an administrative fine on a controller as an example, in terms of causality, the fine is imposed
-
– because the controller’s data processing is not compliant with the GDPR (Article 83 GDPR)
-
– because they lack a legal basis (Article 6(1) GDPR)
-
– because they obtained unlawful consent
-
– because the data subject has not unambiguously indicated their wish to consent (Article 4(11) GDPR)
-
– because the controller obtained consent through a pre-ticked-box (cf recital 31 GDPR)
-
-
-
– and because no other legal basis applies
-
-
-
– and because the fine is effective, proportionate and dissuasive (Article 83(1), (2) GDPR).
By contrast, a – correlation-driven – LLM could, for example, generate an according decision including the phrase “In accordance with Article 58(2)(i) and Article 83 GDPR, the [specific supervisory authority] hereby imposes an administrative fine” based on the prevalence of this phrase in decision samples within the training data that also contain the word combination “pre-ticked box” and other correlated elements. However, as existing LLMs are opaque machine learning models,Footnote 66 it is very difficult and more often impossible to map elements of the generated output text to specific elements of the prompt.
Albeit LLMs do not follow the logic of the law and reach their results fundamentally different from humans, it seems premature to conclude LLMs cannot assist in (legal) reasoning tasks per se.Footnote 67 An LLM can prove useful for the use cases outlined above if it just mimics or simulates legal reasoningFootnote 68 well enough. For example, think of a hypothetical highly performant LLM that drafts a lawful decision according to which a building permit is denied due to an insufficient setback distance. Even though that LLM had no understanding of the meaning of words, the output of such a model could be based on relevant information in the prompt, namely the part of the text in the application for the building permit which contains the information that allows for calculating the distances of the building to neighbouring buildings, property lines, streets, etc. However, a particular challenge lies in new legal norms for which there are no historical case data to train models on, or new judgments that require changes to decision-making practices.Footnote 69 Bringing together IT experts and experienced administrative professionals is imperative to build and use performant models.
Ongoing research explores various approaches to enhance LLMs’ performance in reasoning tasks. For existing models, researchers have proposed to solve legal problems with LLMs through prompt engineering, especially through breaking down prompts to series of reasoning steps.Footnote 70 GPT-4 can solve analytic problems by generating Python code that simulates the task.Footnote 71 In experiments with GPT-4 comprising various tasks from various domains, the model has been found to solve various tasks that require reasoning.Footnote 72 Other researchers fine-tune LLMs on logical reasoning datasets,Footnote 73 or combine LLMs with symbolic solversFootnote 74 to enable them to perform complex reasoning tasks. It remains an open research question, however, to which extent the observations in studies with existing general-purpose LLMs generalise across domains and to sufficiently fine-tuned models.
c. Consistency and changing decision-making practices
In terms of consistency, public general-purpose models have cast doubt about the usefulness of the models for tasks that require consistency. Existing public general-purpose LLMs such as ChatGPT-4 or Gemini usually generate inconsistent outputs, ie, when prompted with the same text prompt multiple times, they will generate different outputs each time. This not only concerns the mere wording (eg, data processing “lacks a legal basis”/“is not lawful”Footnote 75 ), the variance of which may be irrelevant or even desirable for legal applications, but also the meaning of the output (eg, data processing is unlawful/based on consent). If an LLM generates two significantly different outputs for identical inputs, it is probably also more likely to produce divergent outputs for merely similar prompts, which could render an administrative decision unlawful by violating the principle of equal treatment.Footnote 76 For example, if an LLM generates two conflicting decisions – one granting a building permit and the other denying it – for the same case file, it is likely to generate inconsistent decisions for two only slightly similar cases. LLMs are not output-inconsistent per se. After an update to the GPT API,Footnote 77 it is possible to generate more consistent outputs with GPT-4.Footnote 78 However, assessing LLMs’ usefulness in the context of administrative decision-making requires testing whether the models can meet the specific consistency requirements for legal use cases – namely, self-consistency in their outputs and, in administrative decision-making, alignment with the decision-making practices of the respective administrative body.
Albeit consistency in administrative decision-making is desirable and necessary in general, sometimes administrations need and aim at change. For example, a building authority might want to set a new course for the future and change certain decision-making practices, eg, to grant building permits in cases where they would previously have been denied. One concern about the use of machine learning models in administrative decision-making is that the models are trained on historical data and might rather “predict the past” than imagine a future.Footnote 79 LLMs seem to solve this problem to some extent as they can generate novel outputs with high variance rather than being determined by the training data.Footnote 80 This might make decision-support systems based on LLMs less prone to perpetuating decision-making practices from the past than systems based on other models. However, administrative bodies typically plan changes to their decision-making practices with a specific vision for the future in mind. While an LLM could draft a vision for the future of an administrative body and propose concrete measures to work towards this future, existing LLMs have shown poor planning capabilities.Footnote 81 However, the research on LLMs abilities is still at an early stage, and findings from experiments on existing LLMs, at best, provide little information on other, especially domain-specific models. When authorities implement decision-support systems based on LLMs it will be crucial to ensure that the decision-making is sufficiently consistent and changes of decision-making practices do not happen randomly but planned, regardless of whether the LLM is used for planning or not.
d. Errors, bias and manipulation
A main concern about the automation of making, analysing and contesting decisions based on LLMs is that the models may generate erroneous outputs such as an unlawful building permit, a flawed legal analysis of an administrative decision, or an unfounded complaint. Just as human decision-makers, any automated system, regardless of whether it is based on an AI model or a transparent algorithm, produces errors. Errors can render an administrative decision unlawful or a complaint against an administrative decision unsuccessful. While human decision-makers from the administration or legal counsels would likely be able to find and correct all errors, concerned individuals that fully rely on an erroneous LLM might misunderstand decisions concerning them or even file flawed complaints – with the risk of losing their legal remedies. From both a fundamental rights and an efficiency perspective, the use of decision-support systems that produce too many errors is not justifiable. Even decision-support systems that produce less errors than human decision-makers can violate principles of non-discriminationFootnote 82 if their errors disproportionally concern certain groups of individuals. There are many potential causes for erroneous decisions, for example, misinterpretations of legal terms,Footnote 83 poor reasoning abilities,Footnote 84 misinformation in the training data, training data information that is stored incorrectly or incompletely in the model parameters, or – for LLMs specifically – “hallucinations”Footnote 85 such as basing decisions on non-existing or non-applicable legal norms, court rulings,Footnote 86 or an incorrect summary of the case.
Furthermore, albeit LLMs’ can generate outputs that are less determined by the training data,Footnote 87 LLMs – just as other machine learning modelsFootnote 88 – have been shown to adapt biases from the training data.Footnote 89 In the context of administrative decision-making, for example, a model could be biased against Black applicants, when trained with historic decisions in which building permits by applicants from a predominantly Black neighbourhood have been disproportionately denied. In experiments with existing LLMs, biased outputs could be avoided by telling the model not to be biased (in general or with more specific instructions).Footnote 90 Another study led to the conclusion that LLMs are capable of “moral self-correction” from a size of 22 billion model parameters and improve in that task with increasing model size and reinforcement training based on human feedback.Footnote 91 , Footnote 92 However, bigger models can cause other problems, for example, they can facilitate training data “memorisation”.Footnote 93
Also, malicious attacksFootnote 94 can cause incorrect outputs. For example, researchers were able to compromise existing LLMs’ outputs through malicious instructions in prompts (prompt injection).Footnote 95 Robustness against malicious attacks is especially important in administrative decision-making: Who applies for a building permit, must not be able to trick the system into issuing an unlawful building permit by designing the application in a certain way.
When exploring LLMs in the context of administrative decision-making, mitigation techniquesFootnote 96 should be implemented and tested for specific use cases and user groups. Sufficiently checking LLMs for hidden bias and vulnerabilities might be possible through extensive prompting with variations of inputsFootnote 97 : A model could, for example, be fed with slightly varied applications for building permits that do or do not contain discriminatory factors or information that serves as proxy for such information,Footnote 98 ie, indirectly refers to it (eg, an address or a name that indicate the ethnic background of the applicant). Through comparing the outputs of the model then, it can be possible to identify bias, for example if, for the same case the model issues a building permit, if the applicant has a German name, but denies it if the name in the application is changed to an Arabic name. Such tests could also be employed to investigate cases where individuals challenge a decision affecting them as discriminatory.
e. LLMs and the duty to give reasons
Even if an LLM drafts a lawful decision with reasons that justify the decision and are compliant with the applicable legal norms, it is questionable whether authorities that issue this decision fulfil their duty to give reasons (research question (2)). This is especially problematic insofar as existing LLMs are based merely on statistic relationships between words and can only mimic an understanding of their meaningFootnote 99 and of legal reasoning.Footnote 100 , Footnote 101 They draft their outputs word for word, predicting based on the statistical distributions in the training data which word follows another in a specific context – ie, the prompt and the text that the LLM has already generated for this prompt. Even a correct output, for example the lawful decision to deny a building permit, can be based on completely irrelevant elements from the prompt,Footnote 102 eg, the applicant’s address or name when the model has been trained on historic case files including rejecting decisions addressing persons from the same street or with the same name. The reasons stated in a decision that has been generated by an LLM always differ from the reasons for which the LLM has generated this decision.
It is questionable what the duty to give reasons requires for fully or partially automated administrative decisions.Footnote 103 For machine-learning systems it has been argued that some degree of explainability is a prerequisite for fulfilling the duty to give reasons as the duty to give reasons requires explaining why a decision was reached and thus disclosing how different factors were weighed by the system.Footnote 104 Other legal scholars also deem explainability necessary for lawful AI systems,Footnote 105 even though there are trade-offs between model explainability on the one hand and model performance, usability and scalability and training data privacy on the other hand.Footnote 106
However, it is crucial to consider the specific kind of decision-support that LLMs could enable. An LLM does not merely output a binary recommendation (eg, grant/deny a building permit) or a risk scoring (eg, a security risk posed by a third country national that wishes to enter the EUFootnote 107 ), but a complete decision text. If an LLM is used for administrative decision-drafting, two cases must be distinguished: Where the decision generated by an LLM is issued without meaningful human involvement, the reasons stated in the generated text cannot satisfy the duty to give reasons, given the intent and purpose of this obligation. This is because it aims to ensure transparency of the decision-making and, by this, enable individuals, courts and administrative bodies to understand the decision and assess its lawfulness.Footnote 108 LLMs may generate decisions whose reasoning appears transparent, but the reasons generated by the models do not reveal the factors relevant to the decision and their weight as LLMs merely mimic reasoning. However, when a decision-draft by an LLM serves as a starting point for human decision-making, the situation is different, even in cases where the human decision-maker issues the decision without any changes to the LLM’s draft. This is because a human decision-maker who checks an LLM-drafted decision adopts the reasons the LLM has generated to the extent they do not make changes to them (manually or through prompting the model again). In other words, human decision-makers can make LLM-generated reasoning their own. In this situation, neither the concerned individual nor courts or administrations need to understand how the LLM reached its output to assess the reasoning of the decision. Then the functioning of the LLM is not the reason for the decision but simply the reason for the decision being made faster. To substantially explain their decision the human decision-maker does not need to explain the functioning of the LLM.Footnote 109 LLMs, therefore, are especially promising for legal use cases. The fact that the outputs of LLMs are fully verifiable, regardless of the transparency of their parameters, supports the implementation of decision support systems based solely on LLMs, ie, without combining them with other machine learning models whose lack of transparency could limit the verifiability of decisions. Whether or not there is a right to explanation under the GDPR that requires more transparency of the statistical weights of such models when they are used for automated decision-making has yet to be clarified.Footnote 110 It is questionable that such information would facilitate the exercise of individual rights.Footnote 111
It is important to consider, however, that highly performant LLMs might compromise the purpose of the duty to give reasons, which is to enable a thorough assessment by the human decision-maker. This is because it is simply tempting for human decision-makers to just issue an LLM-drafted decision without proper scrutiny rather than to take the time to carefully assess the case. Furthermore, LLMs that generate convincing reasons might reinforce automation bias, the human tendency to over-rely on computer-generated outputs.Footnote 112 Especially there is the risk that users confuse an explanation provided by an insufficiently performant modelFootnote 113 with transparency of its functioning. In practice, it will be hard to distinguish cases in which human reasoning has taken place and cases where the decision was made, in fact, fully automated by the LLM.
2. Human oversight over LLMs
As LLM-based decision-support systems bear a particular risk of human decision-makers issuing decision drafts without scrutiny,Footnote 114 implementing LLMs in administrative decision-making calls for safeguarding human oversight. Human oversight is required under two legal provisions: Firstly, data protection law grants individuals the right not to be subjected to solely automated decisions with legal effects on them.Footnote 115 A decision based on automated processing is not solely based on automated means only if a human decision-maker is not merely formally but meaningfully involved in the decision-making.Footnote 116 This requires human decision-makers to effectively oversee decision-support systems, ie, to scrutinise their outputs. Secondly, for high-risk AI systems Article 14 of the AI ActFootnote 117 stipulates human oversight. Furthermore, as pointed out above,Footnote 118 it is necessary that human decision-makers scrutinise drafts generated by LLMs to fulfil their duty to provide reasons when issuing a decision with LLM-generated reasons.
When implementing LLMs in the context of administrative decisions, it is crucial to avoid automation bias by design, ie, to design the systems in a way that facilitates critical assessment.Footnote 119 Legal scholars assume that explainabilityFootnote 120 facilitates control and oversight over AI systems.Footnote 121 However, this view lacks differentiation. Explainable (or interpretable) AI can either help human decision-makers to identify erroneous outputs, or reinforce automation bias as an explanation can make the output seem even more reliable.Footnote 122 Other approaches to facilitate human oversight should be considered. For example, it is possible to design a decision support system in such a way that it sometimes incorporates errors in its outputs, and informs human decision-makers of this in order to undermine their trust in the system.Footnote 123 Any LLM-based decision-support system should be deployed only after thorough testing and empirical validation to ensure human oversight. This requires user studies with the specific systems, tailored to the specific use case (eg, decisions to grant/deny building permits) and involving members of the particular group of administrative decision-makers who are intended to use the system.Footnote 124 This is because, personal and contextual factors as well as tool characteristics can affect automation bias.Footnote 125 Furthermore, to enable human decision-makers to effectively oversee LLM-based decision-support systems, it is not only important that they have an understanding of the functioning and limitations of LLMs in generalFootnote 126 (also cf Article 13(3)(b) AI Act), but also experiences with the specific system.Footnote 127
Automation bias is not only a problem in the decision-making but also can impair the benefits of tools that are designed to help those who are concerned by decisions. When legal counsels or the concerned individuals themselves over-rely on the outputs of a LLM, this might result in individuals not effectively exercising their remedies or even losing them. Consequently, tools should only be released after representative user studies and be subject to regular audits as both decision contexts and users might change over time. This task is especially challenging for public tools that are intended to be used by legal laypersons from various educational and language backgrounds.
3. “Memorisation” of training data
Some of the biggest legal problems of existing LLMs arise from the “memorisation”Footnote 128 of training data. That information from the training data, during training, is reproducibly stored in the model parameters poses risks especially where this information comprises personal data. The training of an LLM in the context of administrative decision-making requires various data that also comprise personal data: Legislative materials contain the names of individuals involved in the legislation process. Court rulings include names of judges and persons involved in the case. Historic case files refer to applicants, addressees of decisions and other concerned individuals. Research articles and commentaries comprise information on authors. To the extent personal training data are consistently stored within a model’s parameters in such a way that personal information can be extracted by prompting the model, both the LLM and its outputs, when such data is reproduced, constitute personal data.Footnote 129 The probability that memorised personal data are reproduced is arguably higher for publicly available LLMs. This is because a bigger group of potential users, including not only individuals that are subject to administrative decisions but also other interested groups such as researchersFootnote 130 and malicious attackers, might prompt the model to extract training data from it, or even accidentally cause the model to output training data. Where LLMs include memorised personal training data in decisions or complaints, users are at risk of unlawfully processing personal data, especially but not only where the data are inaccurateFootnote 131 due to “hallucinations”. For example, an LLM could feed information on an addressee of a historic decision from the training data into a decision on a building permit.
Similarly, when “memorisation” concerns copyrighted material,Footnote 132 users might unintentionally infringe copyrights. For example, an LLM might include a section from a legal commentary in the explanation of a decision or a complaint without referencing the source. Furthermore, “memorisation” can concern business or trade secrets – with the risk that these secrets are disclosed to users of the system and shared with others. For example, an LLM used for decision-drafting in a building authority might feed information on a manufacturing process of a company into a decision concerning another company, and a human decision-maker might not remove this information from the draft before issuing the decision.
To develop decision-support systems and tools for concerned individuals that are compliant with data protection and copyright requirements, investigating “memorisation” is crucial. It is an open research question to which extent findings on existing models apply to other, especially domain-specific models. Avoiding the use of personal dataFootnote 133 or copyrighted material in training entirely seems impractical and would impair a model’s ability to include correct references in its outputs. Also, “memorisation” can be desirable in certain use cases for some texts from the training data.Footnote 134 To which extent “memorisation” or the reproduction of training data in the output should be avoided must be determined for the specific use case, model, user group, and training data, considering the risks of training data reproduction. In any case, it seems necessary to efficiently avoid the reproduction of personal training data and unreferenced copies of training texts. When addressing other challenges of LLMs in the context of administrative decision-making, it is crucial to consider potential trade-offs with “memorisation” mitigation. For example, a model with more parameters might reach better results in legal reasoning tasks, but be more likely to reproducible storing training data information within its model parameters.
V. Conclusion and outlook
Existing models’ properties and abilities make LLMs a promising tool to increase the efficiency of administrative decision-making processes, and to help individuals to understand, analyse, and contest decisions concerning them. For human decision-makers in administrations, LLM-generated decision drafts could serve as a starting point for decision-making. However, the models pose various challenges that may impair both their performance and their compliance especially with requirements under administrative law, fundamental rights, data protection law, the AI Act and copyright law. A main challenge for LLMs in the context of administrative decision-making lies in training models that are sufficiently performant in legal reasoning tasks and thus able to generate complete legal assessments that are consistent with the specific case and applicable legal norms. There are numerous approaches to improve LLMs’ performance in legal drafting. It is especially questionable whether LLMs need to be combined with rule-based systems that follow a formal logic, or LLMs can be trained to mimic legal reasoning with sufficient results based on mere correlations. It will be crucial to avoid errors, biased outputs and manipulation attacks. Ensuring model performance and accuracy is especially critical for LLMs that are intended to be used by legal laypersons such as concerned individuals that cannot afford a lawyer. When assessing LLMs’ usefulness for administrative decision-making, their compatibility with the need for consistent administrative decisions on the one hand, and the need for planning changes in decision-making practices on the other hand must be considered. Unlike other opaque machine learning models, the outputs of LLMs can be fully examined for legal compliance. The article argues that the administration can, therefore, fulfil the duty to provide reasons even with LLM-generated reasons if a human thoroughly checks the LLM-generated draft. However, to facilitate a careful assessment of the case and to meet human oversight requirements, it is necessary to design administrative decision-support systems and tools for concerned individuals and their counsels in a way that facilitates human oversight. Which design elements and functionalities achieve this goal, is an open research question and depends on the concrete use case and user group.
Whether and if so, to which extent and under which circumstances, LLMs can facilitate making, analysing and contesting decisions, and how human oversight can be ensured, can only be assessed through fine-tuning domain-specific LLMs and testing them for specific use cases involving members of the specific user group. This is because observations from previous, inherently experimental research on existing models do hardly provide any insights about the performance and lawfulness of new, domain-specific LLMs in the context of administrative decision-making. The development of performant LLMs with the necessary safeguards requires a holistic view of requirements and challenges, as it is important to identify trade-offs, for example between bias prevention and “memorisation” prevention,Footnote 135 performance and “memorisation” prevention,Footnote 136 or explainability and performance.Footnote 137 This requires bringing together at least legal experts and practitioners to define legal and practical requirements, computer scientists to prepare training datasets, fine-tune domain-specific models, implement safeguards and run tests, and social scientists to carry out empirical user studies examining, in particular, user attitudes towards LLMs in the context of automation bias. Based on such studies, for specific use cases, it will be possible to determine the costs and benefits of the models’ development and use, and to assess whether, and if so, to which extent LLMs can increase the efficiency of making, analysing and contesting administrative decisions. As it is possible to extract training data, including personal data, from existing pre-trained LLMs, researchers must ensure compliance of their research with data protection law.
The generation of complete legal texts with legal reasoning is not only relevant in the context of administrative decision-making, but for diverse use cases in the legal domain. The prospect of LLMs becoming sufficiently advanced to draft laws, administrative decisions, court rulings,Footnote 138 legal submissions, and scholarly essays raises the need for a societal debate on automation. This dialogue should also consider dystopian scenarios where communication in legislative procedures, administrations, courtrooms, and academic legal circles is dominated by LLMs interacting with each other, with minimal human input.
Acknowledgments
The author sincerely thanks Prof. Rainer Böhme (University of Innsbruck) for helpful comments, and Shalu Mohan (FAU Erlangen-Nürnberg) for her invaluable help with the manuscript and her constructive feedback. Also, the author expresses her gratitude to the INDIGO project team and project-external participants of the final project conference for thought-provoking discussions.
Competing interests
The author has no competing interests to declare.