“The cat is out of the bag. Universal Basic Income is 760 euro each month. Unemployment benefits, sick pay and other benefits will be abolished. To be used for self-realisation. Crazy. #Tegenlicht”
This is an example of how a Dutch socialist politician argued against Universal Basic Income. As many scholars nowadays recognise, such arguments and ideas within policy discourses are crucial to understanding policy reform processes (e.g. Beland, Reference Béland2019; Prior, Hughes and Peckham, Reference Prior, Hughes and Peckham2012; Schmidt, Reference Schmidt2008). Political coalitions are constructed and communicated through arguments – or positions – expressed in the media. Similarly, policymakers and other stakeholders may rally around ‘good ideas’ that draw attention in media debates (e.g. Willems and Beyers, Reference Willems and Beyers2023). Moreover, media coverage pressures political actors to (not) implement policy reforms (e.g. Jensen and Wenzelburger, Reference Jensen and Wenzelburger2021). Whether seen as the glue that binds coalitions or as the driver of policy change, policy discourse has become a major concern for those interested in policy processes.
The value of policy discourse analysis is demonstrated by its broad applicability, for example, in analyses on environmental policy (Gutierrez Garzon et al., Reference Gutierrez Garzon, Bettinger, Abrams, Siry and Mei2022), welfare policies (Blum and Kuhlmann, Reference Blum and Kuhlmann2019; Theiss, Reference Theiss2023), urban mobility (Towns and Henstra, Reference Towns and Henstra2018) or educational policy (Symeonidis, Francesconi and Agostini, Reference Symeonidis, Francesconi and Agostini2021). Moreover, those using the discourse network analysis framework also aim to identify coalitions on the basis of arguments employed in policy debates (e.g. Eder, Reference Eder2023; Fergie et al., Reference Fergie, Leifeld, Hawkins and Hilton2019; Gupta et al., Reference Gupta, Ripberger, Fox, Jenkins-Smith and Silva2022; Leifeld, Reference Leifeld2013; Markard, Rinscheid and Widdel, Reference Markard, Rinscheid and Widdel2021). Policy discourse analysis routinely relies on content analysis. In the context of policy discourse, content analysis is used to annotate several elements, such as actors, arguments or policy solutions, and actors’ (dis)agreement with arguments. Here, we focus on the task of identifying policy solutions or arguments (i.e. ‘thematic analysis’; see Nguyen-Trung, Reference Nguyen-Trung2024) as the most important and difficult-to-code element. The difficulty with identifying arguments lies not only in the first stage, wherein a coding scheme is defined through the interpretation and comparison of texts, but also in the time and effort spent in applying the crystallised coding scheme to large amounts of documents. This specific subtask of content analysis is sometimes referred to as a ‘directed’ or ‘supervised’ stage of content analysis (e.g. Hsieh and Shannon, Reference Hsieh and Shannon2005; Petchler and Gonzáles-Bailon, Reference Petchler and González-Bailon2015). While the automated classification of arguments would vastly increase the speed and scale of analysing policy discourse, attempts to automate this process have thus far not produced very accurate results (e.g. Ceron et al., Reference Ceron, Barić, Blessing, Haunss, Kuhn, Lapesa, Padó, Papay and Zauchner2024; Haunss et al., Reference Haunss, Kuhn, Padó, Blessing, Blokker, Dayanik and Lapesa2020; Lapesa et al., Reference Lapesa, Blessing, Blokker, Dayanik, Haunss, Kuhn and Padó2020).
A solution to this problem might lie in the much-discussed language model ChatGPT.Footnote 1 Much like bag-of-words classification (Kowsari et al., Reference Kowsari, Jafari Meimandi, Heidarysafa, Mendu, Barnes and Brown2019; see also Grimmer and Stuart, Reference Grimmer and Stewart2013) and bidirectional encoder representations from transformers (BERT) models, ChatGPT can be instructed to classify texts in terms of containing a particular topic. The use of ChatGPT as a tool for automated content analysis might have enormous potential as a tool to automatically analyse the contents of text documents. First, the time used for manual content analysis can be dramatically sped up by using large language models (LLMs) such as ChatGPT. Second, as with other automated approaches, ChatGPT can handle large amounts of data, allowing researchers to extend their scope across larger periods of time or multiple contexts. Manual annotation is limited to a sample of the corpus to validate the results. Third, unlike typical natural language processing techniques, using ChatGPT’s user interface does not require special programming skills or statistical knowledge. Any researcher with one or more well-defined topics could extend their content analysis in a corpus of virtually unlimited size. As such it opens a world of opportunities for social scientists working in the qualitative or survey traditions.
In this paper, we thus seek to validate the classification abilities of ChatGPT in the context of policy debates in Germany (pension reforms, see Leifeld, Reference Leifeld2016) and the Netherlands (Universal Basic Income, see Gielens, Roosma and Achterberg, Reference Gielens, Roosma and Achterberg2022). Comparing these cases represents a strong empirical test as it grants insight into the generality of ChatGPT’s abilities, namely in handling policy debates in different languages and of different lengths, contexts and complexity. We first assess the reliability of the model by repeating classifications over several iterations. We then compare the classifications provided by ChatGPT with the human annotations of these datasets. In the following sections, we first present the existing literature on content analysis relying on LLMs such as ChatGPT. Next, we describe the datasets and our methodological approach. We then present the results before moving to a discussion of the implications and limitations of our analysis.
ChatGPT for text analysis
Political scientists have been fascinated with the potential for automated text analysis for some time (e.g. Grimmer and Stewart, Reference Grimmer and Stewart2013; see also Slapin and Proksch, Reference Slapin and Proksch2008). More recently, the field has turned to large language models to efficiently identify, for example, ideological placement (Rheault and Cochrane, Reference Rheault and Cochrane2020), political emotions (Widmann and Wich, Reference Widmann and Wich2023) and political manifestos (Laurer et al., Reference Laurer, Van Atteveldt, Casas and Welbers2024; Licht, Reference Licht2023). Miller, Linder and Mebane (Reference Miller, Linder and Mebane2020) explored an active labelling strategy where manual classification is aided by a text algorithm to select relevant documents (see also Alshami et al., Reference Alshami, Elsayed, Ali, Eltoukhy and Zayed2023).
Researchers have now turned to exploring the text-analytic abilities of ChatGPT in applications of text analysis. For example, Prakash et al. (Reference Prakash, Wang, Hoang, Hee and Lee2023) have used the model to identify topics in a collection of memes by clustering texts or images on the basis of similarities in images and words. In their application, ChatGPT ‘outperformed well-established topic models across three distinct datasets’ (p.8). A rapidly expanding number of studies have also used ChatGPT for content analysis, much like in the current application. Huang, Kwak and An (Reference Huang, Kwak and An2023) have used the model to classify tweets containing hate speech, finding that ‘ChatGPT correctly identified 80% of the implicit hateful tweets in our experimental setting’ (p.4). Wang et al. (Reference Wang, Liu, Xu, Zhu and Zeng2021) find that the chatbot classifies news topics with an accuracy ranging between 77.5 per cent and 87.5 per cent depending on the labelling strategy. Gilardi, Alizadeh and Kubli (Reference Gilardi, Alizadeh and Kubli2023) analysed topics and frames in tweets and news articles, showing that GPT3 classifications matched trained annotators in around 60 per cent of cases. Moreover, they find that the chatbot substantially outperforms crowd-workers recruited on Mturk (see also Horn, Reference Horn2019). Morgan (Reference Morgan2023) and Turobov et al. (Reference Turobov, Coyle and Harding2024) investigate how GPT performs relative to manual annotation in thematic analysis and topic classification of focus groups and United Nations policy documents, respectively, but they do not provide measures of classification performance. In a wide array of classification tasks, Ziems et al. (Reference Ziems, Held, Shaikh, Chen, Zhang and Yang2023) show that GPT4 performance differs strongly between types of utterances. The model does well at identifying stance and ideology but performs poorly in classifying, for example, misinformation and implicit hate speech. These applications demonstrate the potential and limitations of ChatGPT in clustering and classifying policy debates.
With this contribution, we add to this emerging field of study in two ways. First, we extend the validation of this method to the application of policy debates. While some attention has indeed gone to our sources of interest – newspaper articles and tweets – prior applications have involved specific datasets on hate speech (Huang, Kwak and An, Reference Huang, Kwak and An2023), content moderation (Gilardi, Alizadeh and Kubli, Reference Gilardi, Alizadeh and Kubli2023) and general news topics (Wang et al., Reference Wang, Liu, Xu, Zhu and Zeng2021) that do not necessarily translate to the classification of arguments in policy debates. In addition to being valuable for students of policy discourse, validating this classification task contributes to demonstrating the general applicability of ChatGPT in classification problems.
Second, we add to prior research by introducing more fine-grained evaluation methods. Existing studies have relied heavily on accuracy and inter-rater reliability metrics (Gilardi, Alizadeh and Kubli, Reference Gilardi, Alizadeh and Kubli2023; Huang, Kwak and An, Reference Huang, Kwak and An2023; Wang et al., Reference Wang, Liu, Xu, Zhu and Zeng2021; cf. Ziems et al., Reference Ziems, Held, Shaikh, Chen, Zhang and Yang2023). When datasets are unbalanced – that is, there are more non-occurrences (0) than occurrences (1) of arguments – accuracy estimates are biased towards classifying non-occurrences (e.g. Juba and Le, Reference Juba and Le2019). We therefore also account for precision and recall, indicating the rate of false positives and false negatives in bot-labelled argument occurrences. A more detailed explanation of these metrics is included in the methods section.
Lastly, we generate relevant insights by performing our empirical test on different models (GPT-3.5 Turbo and GPT-4 Turbo) and through different types of access (user interface [UI] and the application programming interface [API]). This allows for speculation about future improvements in the content analysis abilities of ChatGPT and gives practical hints for researchers interested in relying on ChatGPT in future research. We expect significant improvements with the newer version but similar performance between different access types (API versus UI). Although we do not expect major differences in performance between API and UI, the comparison is valuable for potential end-users for reasons of accessibility (API is paid and requires programming skills) and time (API is considerably faster).
Datasets
We used two distinct datasets, each representing a unique case study with its own sets of characteristics. The first dataset is a collection of tweets on a Universal Basic Income (UBI) in the Netherlands posted between 2014 and 2016. The second dataset is composed of newspaper articles on the German pension reforms published between 1993 and 2001. We specifically relied on datasets analysed in previous peer-reviewed studies (see Gielens, Roosma and Achterberg, Reference Gielens, Roosma and Achterberg2022; Leifeld, Reference Leifeld2016) to ensure the quality of the human coding, which is used as a comparative benchmark for ChatGPT’s coding performance. For each of these datasets, we assess performance for the ten most frequently adopted arguments.
The arguments and descriptions were initially created by human coders in the publications. These studies inductively developed and refined coding schemes and were reviewed by second researchers to test their reliability. The tweets dataset adopted a formal inter-coder reliability procedure, yielding an average Cohen’s Kappa of κ = 0.430 across arguments. This can be considered ‘fair to good’ (Fleiss, Reference Fleiss1981, p. 218), especially considering the high number of arguments and lack of context present in tweets. The newspaper dataset assessed reliability by having a second researcher evaluate all labels, with discussion and revision in cases of disagreement.
We used the description of arguments provided in the codebooks of these studies to test how successfully the manual coding can be replicated using ChatGPT. We used argument descriptions in the original language, which we took from the source data of the content analysis. However, because the argument descriptions were sometimes very technical, we shortened and simplified the description where possible. These descriptions will be used as input for the classification task. The descriptions of the arguments in their original language, as they are used in the classification task, are available in Appendix A. The English translation of these argument descriptions is presented in Tables 1 and 2.
Dataset 1: Dutch Twitter data on universal basic income
Dataset 1 consists of 5,128 Dutch Twitter users discussing the benefits and disadvantages of the Universal Basic Income (UBI) policy proposal. Gielens, Roosma and Achterberg (Reference Gielens, Roosma and Achterberg2022) manually analysed the content of these tweets on 3 days of peak attention between 2014 and 2016. They identified fifty-six arguments in favour or against UBI within these tweets. A detailed description of the dataset and coding process can be found in Gielens, Roosma and Achterberg (Reference Gielens, Roosma and Achterberg2022). Twitter data are an interesting case study for our analysis due to the messy nature of tweets. Tweets are short messages with a platform-specific writing style, difficult to understand in isolation. Machines often find it hard to clearly identify topics in this type of data (Duarte, Llanso and Loop, Reference Duarte, Llanso and Loup2017), so a good performance of ChatGPT would be encouraging. The codebook for this dataset contains arguments related to a specific policy. The ten most frequently occurring topics are included in our analysis (see Table 1).
Dataset 2: newspaper articles on the German pension policy reform
Dataset 2 includes statements collected from German newspapers in the time frame of January 1993 to May 2001, preceding the German Riester pension reform. The dataset includes 7,249 statements about sixty-eight concepts from 1,879 articles of political actors in this period, which were also identified by human coders. A detailed description of the dataset can be found in Leifeld (Reference Leifeld2016). The second dataset serves as a stark contrast to the first dataset, as newspaper articles are written by experts trained to write clear texts that are not restricted by very short character limitations, and are thus able to present a more coherent picture than possible in a tweet. Accordingly, the structured and expertly crafted nature of the statements provides a reliable comparable benchmark to the ‘messy’ tweets in dataset 1, making it an interesting point of departure for our analysis. The codebook for this dataset contains proposed policy solutions related to the financing gap in the pension system, rather than arguments for or against one specific policy. The two datasets hence cover a broad spectrum of codes found in policy discourse analysis. The ten most frequently occurring topics are included in our analysis (see Table 2).
Sampling
To save time and money, we selected a stratified random sample of tweets and newspaper articles. We used a stratified random sample rather than a simple random sample to ensure that the categories we used for coding were well-represented in the analytical sample. For each of the ten selected arguments, we sampled 50 per cent of the documents containing that argument, removing duplicates. Documents in each dataset are randomly shuffled to minimise bias.
We sampled 50 per cent of tweets containing each argument included in our analysis. For example, the deregulation argument was identified in 244 tweets, so the sample contains 122 tweets mentioning the deregulation argument. Our sample for dataset 1 contains 1,282 tweets. We sampled 25 per cent of newspaper articles containing each argument included in our analysis. This fraction is lower because newspaper articles are long and more expensive and time consuming to process. Our sample for dataset 2 contains 537 newspaper articles.
Methods
Our methodological approach consists of three steps. First, we designed the instruction (prompt) for ChatGPT for the API version (GPT-4 Turbo) and the user interface (GPT-4). Rather than writing a different prompt for each debate, we developed prompts that can be used for any policy debate. Next, we assess the performance of the model in terms of reliability and validity. Since GPT-4 turbo (API) is more capable and less error-prone than GPT-4 (OpenAI, 2024), we used GPT-4 Turbo as the reference model for our analyses and relied on the standard version of GPT-4 (user interface) as a complementary model.
Prompt engineering
ChatGPT’s performance largely depends on the prompt and context provided. We formulated an extensive prompt for directed qualitative content analysis, which was developed by relying on a trial-and-error process. A rapidly expanding literature on prompt engineering has emerged to test which instructions are most effective across a variety of tasks. Broadly speaking, there are four core elements of prompt engineering: providing context, asking a question, setting model parameters, and providing output constraints. Specifying the context of a question is important to guide performance. Ekin (Reference Ekin2023) notes that specifying a knowledge domain (e.g. Universal Basic Income) and a role for the chatbot (e.g. a social policy expert) improves performance. Clavié et al. (Reference Clavié, Ciceu, Naylor, Soulié and Brightwell2023) find that adding a (any) name to the role of the bot also improves performance. Asking a question is best done in single sentences, breaking up complicated instructions into multiple prompts (Wu and Hu, Reference Wu and Hu2023). In such multi-turn dialogue, Clavié et al. (Reference Clavié, Ciceu, Naylor, Soulié and Brightwell2023) further point out that better results are obtained when asking whether the chatbot understood the instruction and providing positive feedback in between prompts. Moreover, they find that prompts that compare options A and B elicit deeper reasoning. Regarding parameters, Wu and Hu (Reference Wu and Hu2023) find that setting a lower temperature increases the focus and reduces the randomness of replies (see ‘model parameters’ section below). Similar advice for prompt engineering is provided by OpenAI, who advocate for similar strategies and tactics such as writing clear instructions, splitting complex tasks into simpler sub-tasks (‘prompt chaining’) or providing reference texts and examples.Footnote 2
Persona: You are a professional researcher named Jakub. You are an expert on qualitative content analysis. You are always focussed and rigorous.
Task Description: Analyse [language] [document_type] for arguments related to [policy_name]. [policy_description]. The analysis will identify whether [document_type] contain arguments for or against [policy_name].
On the basis of these suggestions and several trial-and-error adjustments, we used the prompt below. First, we set the following system instructions:
For each [document_type], provide a classification for each argument in an HTML table.
Do not include the text of the [document_type] in the table. Only report the classification values.
The HTML table has 5 rows, one per [document_type].
The HTML table has 10 columns, one per argument.
The elements of the table are ‘0’ and ‘1’. Indicate ‘1’ if the [document_type] discusses aspects of the specified argument and ‘0’ is the [document_type] does not discuss the specific argument.
Here is an example of the required output format:
[example_output]
Then we entered the following messages in a prompt chain:
Determine whether a [document_type] discusses each of the following ten arguments:
[[arguments]]
[document_type] contain an argument if the author opposes the argument and also when the author argues in favour of the argument.
[policy_name] need not be mentioned explicitly in the [document_type] to relate to the argument. A [document_type] can discuss more than one argument.
You will now be provided with 5 [document_type] separated by a new line.
[[documents]]
Finally, we spent a substantial part of the prompt specifying the output format:
By entering values for the [variables] this prompt can be altered to any policy type and any number of arguments. The fully written prompt versions used for the two datasets can be found in Appendix B. The full script and other documentation (e.g. full prompt for user interface) will be available via the Open Science Framework.
Reliability
ChatGPT is a generative model – also known as probabilistic or non-deterministic – which means that answers can differ if asked in different chats or when asking the same question twice in the same chat. Therefore, before comparing the model classification with human annotations, we tested the reliability of the model by simply re-iterating the procedure k = 5 times and correlating the resulting vectors of zeros and ones. To evaluate the reliability of these replications, we calculate the phi correlation, a binary measure of association, for each unique combination of replications. The phi correlation is symmetrical, so we have [k*(k−1)]/2 = 10 unique chat combinations. The reliability of the chatbot can be inferred from the mean and variance of these correlations.
Validity
We evaluate the validity of the automated classification by comparing the assigned labels with human annotations. We rely on three popular and intuitive metrics to assess the performance compared to human annotation (e.g. Powers, Reference Powers2020). The calculation of these metrics is visualised in Figure 1.
-
a) Accuracy equals the overall percentage of agreement between the chatbot and human coders. It is calculated by dividing the number of true positives and true negatives by the number of documents. It is important to emphasise that, in this context, accuracy means agreement with the human coder, who is also not flawless in interpreting true intents.
-
b) Precision reflects the amount of ‘noise’ in the documents classified as containing an argument. It is computed as the percentage of true positives amongst all chatbot positives. This is the same as the inverse false positive rate.
-
c) Recall shows how often the chatbot detects an argument in documents that contain this argument according to the human annotators. Recall is conceptually analogous to statistical power.
Model parameters
Results are computed using the gpt-4-turbo-preview model. We set the model temperature to 20 per cent as suggested by, for example, Gilardi, Alizadeh and Kubli (Reference Gilardi, Alizadeh and Kubli2023, p.3). The temperature influences the ‘randomness’ of word predictions, so that low temperatures are more likely to select words with high probability of occurring next in the sentence (e.g. Davis et al., Reference Davis, Van Bulck, Durieux and Lindvall2024). In practice, low temperatures lead to more deterministic outcomes, sometimes referred to as more focussed and fact-based. For our purpose, setting a low temperature aids in getting a ‘clean’ response from the model, that is, a set of classifications without any additional text. Documents are supplied in ‘batches’ of five, meaning that we repeat the prompt with five new documents until all documents have been processed.Footnote 3
Comparing approaches
We run the analysis with both the user interface (UI) and the application programming interface (API). The UI is an HTML-based website that can be run in a browser, and much like the Android application, the way in which people would normally interact with ChatGPT. The API provides back-end access to the model, typically used by application developers.Footnote 4 Via programming languages such as Python and R, we can send prompts directly to the servers of ChatGPT. The main advantage is the ability to automate requests, rather than inserting each batch manually. Additionally, the API has exclusive access to model parameters such as version, role and temperature settings. Thus far, no studies compared the performance of UI and API. This is unfortunate since the UI is relatively cost-effective (we used a Plus subscription) and much easier to use for those with little programming experience. The main disadvantage of the UI is the effort spent manually submitting input and extracting output as well as the message limit (at the time of the analysis: forty messages/3 hours). The API, on the other hand, allows for prompt-chaining (repeating a set of messages rather than sending one big message) and a flexible batch size (we supplied five documents per run). However, API access is technical to set up and can get expensive – especially for the later models – depending on the size of the corpus. At the time, one run of each dataset described above cost approximately 10 euros, at a rate of $0.01/$0.03 per 1 k input/output tokens. Below we discuss the difference in terms of performance of these access points.
Secondly, we implemented a ‘cut-off’ approach to try and reduce uncertainty in classifications. Wang et al. (Reference Wang, Liu, Xu, Zhu and Zeng2021) used a similar ‘few-shot’ approach where classifications (in this case logit estimates) are repeated, finding that GPT3 outperforms single repetitions in terms of accuracy. We repeated the estimation five times, obtaining five argument classifications for each document. A document is then classified as containing an argument when it is classified as such in at least 3/5 repetitions. This approach mimics the method typically employed in logistic regression, where an outcome is predicted to be present when the predicted probability is above 50 per cent.
Results
Figure 2 shows the reliability of ChatGPT in classifying arguments in our two datasets. Overall, the classifications are quite reliable. Without distinguishing between arguments, classifications are correlated with φ = 0.71 for newspaper articles and φ = 0.84 for tweets. Results are thus somewhat more reliable for (shorter) tweets than newspaper articles.
The lower reliability for newspaper articles is due to greater variation within and between arguments for newspaper articles. Results are consistently less reliable for the argument contributions, with an average correlation between runs of φ = 0.67. Within arguments we see that correlations differ within the argument economic development, with the lowest correspondence at φ = 0.66 and the highest at φ = 0.95. Especially for newspaper articles, then, we find that the chatbot may return somewhat different classifications between runs.
We now turn to the validity of the classifications in terms of accuracy and recall. The total average performance of the chatbot compared to human annotators is presented in Figure 2. For the newspaper dataset, 78.7 per cent of all bot classifications match the human annotations. For the tweets dataset, accuracy is even higher: 91.7 per cent of all classifications are in agreement between humans and the chatbot. For interpretability, we report the average accuracy between replications, which is a good indication given that variation between replications is less than 1 per cent. These accuracy scores are good, especially given the complexity of the task (Figure 3).
Total precision estimates, however, are underwhelming. For newspaper articles we find a precision rate of 41.3 per cent, meaning that 58.7 per cent of articles are false positives: they contain an argument according to the chatbot but not the human annotators. Classification of tweets is slightly less noisy: precision is 58.6 per cent. If we again take human annotations as true positives, then 41.4 per cent of chatbot classifications are false positives or noise. If this is indeed the case, this shows that argument classifications from ChatGPT must be cautiously used.
Total recall values are not impressive but acceptable. For the newspaper dataset, we find that 60.1 per cent of all occurrences found by humans are also identified by the chatbot. Inversely, this means that 39.9 per cent of bot-identified arguments are false negatives, found by humans but not by the chatbot. Similarly, for the tweets dataset, we find that 58.6 per cent of arguments found by humans are also detected by the chatbot, with a corresponding false negative rate of 41.4 per cent. These total recall values illustrate that, while accuracy may be high, ChatGPT often misses occurrences of arguments that are found by humans. This may be due to the complexity of the task, and the limited context provided in the argument description. Still, a detection rate of 60 per cent would be acceptable in some cases, especially when these are the most obvious occurrences.
Moreover, a part of the recall problem is due to considerable variation between arguments (see Figure 4). For both datasets, we find that some arguments are much more easily detected than others. For the arguments private capital and freeriding, for example, 80.3 per cent and 80.8 per cent, respectively, of human annotations are identified by the chatbot. In contrast, for economic development and security, only 24 per cent and 39.3 per cent of human labels are identified, respectively. We suspect this variation may be due to complex statements that refer to the argument, and perhaps due to the difficulty of devising a good argument description.
Comparing approaches
At first sight, the user interface (UI) performs slightly worse than the average API run, despite identical argument descriptions and near-identical prompts. Bear in mind, however, that the comparison is somewhat unstable because we did not perform multiple runs for the user interface. Multiple runs for the UI approach take time and effort but would reduce the degree of random variation in results. However, given that variation between runs is limited (as shown in Figure 1), we still believe this comparison is valid when it concerns large differences in performance (say, more than 5 percentage points).
Figure 5 shows the total accuracy, precision and recall per dataset. The upper and lower crossbars indicate the range in performance between arguments. Total accuracy is comparable between the API and UI methods, for both datasets. Precision provides a mixed signal for performance between methods: the API yields more precise estimates on the tweets data (a difference of 8.5 percentage points), but less precise estimates on the newspaper data (a difference of 6.4 percentage points). Recall is somewhat better in the average API run compared with the UI run, but only for the newspaper dataset. Overall, we conclude that the API performs roughly similar to the UI.
Finally, we evaluate what we have called the cut-off method, where an argument is deemed present or absent when 3/5 repetitions agree. Interestingly, the method introduces a trade-off between precision and recall compared with the average API run. Accuracy is practically equivalent between the two methods, for both newspapers and tweets. In both datasets, especially in the newspaper dataset, the cutoff approach substantially increases precision at the cost of reducing recall. In other words, using a cut-off value decreases the level of noise in the classification (i.e. fewer false positives) but loses detection power (i.e. more false negatives). In both datasets, the gain in precision almost exactly matches the loss in recall. The fact that this happens in both datasets suggests that this is not a matter of coincidence. With this method, researchers have the option to reduce noise in the estimates when it is deemed more important than detecting all arguments present (Figure 6).
Discussion
This study evaluated the performance of ChatGPT (GPT-4 Turbo and GPT-4) as a tool to perform directed content analysis for policy debates on two very different data sources: Dutch tweets on Universal Basic Income and German newspaper articles on the Riester pension reform. Our results show relatively high levels of accuracy and reliability for the tweets dataset and – to a slightly lesser extent – for the newspaper articles. However, there are three main concerns when using ChatGPT to automate content analysis.
First, while overall results are positive, we show that bot-labelled argument occurrences contain a fair amount of noise (precision) and fail to detect a good number of human-labelled occurrences (recall). The primary reason for the discrepancy between accuracy versus precision and recall is that the occurrence of arguments, much like the occurrence of hate speech, is naturally ‘imbalanced’. There are bound to be more documents that do not contain one specific argument than documents that do contain that argument, a situation that is only exacerbated when the number of arguments increases. Even the most adopted arguments in the UBI debate, for example, only occur in around 1–2 per cent of all tweets under investigation. Since accuracy measures the correct classification of both occurrences and non-occurrences, without any discrimination between the two, the value is highly determined by ChatGPT’s ability to correctly classify non-occurrences or ‘true negatives’. Precision and recall are better suited to evaluate the model’s ability to identify arguments’ occurrences, because they essentially disregard the correct identification of non-occurrences (see methods section for details). To avoid an overly optimistic evaluation of the method (Gilardi, Alizadeh and Kubli, Reference Gilardi, Alizadeh and Kubli2023; Huang, Kwak and An, Reference Huang, Kwak and An2023; Wang et al., Reference Wang, Liu, Xu, Zhu and Zeng2021; cf. Ziems et al., Reference Ziems, Held, Shaikh, Chen, Zhang and Yang2023), we therefore argue that it is vital to include statistics on precision and recall when validating such classification tasks.
Second, since performance varies strongly between arguments, we suggest that the method is better suited for arguments with a clear description and well-defined associated vocabulary in the documents it intends to classify. Interestingly, most performance indicators turned out to be better for the Twitter data compared with the newspaper articles. This is insofar surprising as the ‘messiness’ of tweets and the well-crafted nature of newspaper articles could lead one to expect the opposite. Given that both original texts were not English, we conclude that rather than the nature of the text itself, context length might be a bigger problem for ChatGPT. Better results might thus be obtained by dividing newspaper articles and other policy documents into more ‘digestible’ chunks for ChatGPT. Alternatively, however, the variance in performance between arguments may also be caused by human mistakes in the coding process.
Third, when relying on language models, manual labelling of a subset of the debate will still be required to (a) identify arguments and their descriptions to establish a codebook and (b) validate the classifications generated by the chatbot. When more general topics suffice, a researcher may consider using topic models to find the most important arguments in the discussion. When arguments’ definitions are available and validation is of little concern, language model classification could be directly applied. For most applications, however, we recommend manually annotating a subset of the data. This subset can be used to build a grounded coding scheme and to evaluate the performance of the chatbot using the same techniques as elaborated in this article. For those interested in applying and developing this method, we have published the scripts and tweets data on the Open Science Framework for public use.Footnote 5
So, does ChatGPT herald the end of human annotators? While it is tempting to make such grand statements, the fair answer is no. First, humans are (at least for now with the current models) better at doing content analysis than the most prominent large language model. Nonetheless, there is reason to believe that the models will continue to improve. Like Ziems et al. (Reference Ziems, Held, Shaikh, Chen, Zhang and Yang2023, p. 21), we observed a substantial improvement in performance between models GPT4-Turbo (API) and GPT4 (UI). Moreover, improvements were even more substantial between GPT4-Turbo (API) and GPT3.5-Turbo (API), especially in terms of precision (see Appendix C). In turn, this suggests that future models may equal or even surpass the current golden standard of human annotators. However, the realisation of such improvements will depend on the model’s future ability to prevent collapse when trained on recursively trained data (Shumailov et al., Reference Shumailov, Shumaylov, Zhao, Papernot, Anderson and Gal2024).
Like any other, this study also has its limitations, which point to avenues for future research. Our approach is limited to the automated application of a pre-defined coding scheme to identify arguments. The method is not supposed to generate codebooks or identify arguments without some prior definition. This also means that the challenge of automatically identifying actor positions towards arguments – that is, whether they agree or disagree with an argument – remains to be addressed in future research. We did not compare performance with methods such as topic modelling, specialised LLMs or competitors of ChatGPT. Haunss et al. (Reference Haunss, Kuhn, Padó, Blessing, Blokker, Dayanik and Lapesa2020) and Lapesa et al. (Reference Lapesa, Blessing, Blokker, Dayanik, Haunss, Kuhn and Padó2020) employed natural language processing (NLP) methods combining transformers and recurrent neural networks to predict arguments for policy discourse analysis. Ceron et al. (Reference Ceron, Barić, Blessing, Haunss, Kuhn, Lapesa, Padó, Papay and Zauchner2024) found comparable performance to these methods to the results reported here. While prior studies seem to suggest that GPT4 performance for political texts is good, training and evaluating a dedicated language model for arguments in policy debates remains a valuable course of action.
We hope this study will serve as a starting point, providing the baseline results against which future strategies for improvement can be assessed. Several strategies to further improve the performance of ChatGPT classifications are imaginable. First, variance of performance between arguments may indicate not just different levels of intrinsic difficulty of arguments, but potentially also heterogeneity in the set of descriptions. While we refrained from extensively tweaking prompts and argument descriptions to avoid ‘overfitting’ descriptions on the data, more detailed attention to prompt design and argument descriptions may further enhance performance in the future. One approach is ‘template refinement’, that is, reducing the number of arguments by grouping them into clusters (King et al., Reference King, Brooks, Tabari, Ciesielska and Jemielniak2018; Nguyen-Trung, Reference Nguyen-Trung2024). We also look to prompt engineering – that is, the specific formulation of instructions – a rapidly developing field seeking to optimise LLM performance through prompting techniques (e.g. Clavié et al., Reference Clavié, Ciceu, Naylor, Soulié and Brightwell2023; for an application to classification tasks, see Thomas et al., Reference Thomas, Spielman, Craswell and Mitra2023). Fine-tuning models may also provide further improvements to classifications. Wang et al. (Reference Wang, Liu, Xu, Zhu and Zeng2021) found that fine-tuned LLMs outperform the basic GPT3 model. Ziems et al. (Reference Ziems, Held, Shaikh, Chen, Zhang and Yang2023) found that fine-tuned classifiers outperform GPT4 on some types of text (e.g. misinformation) but not political texts on ideology and stance. Considering the drastic improvements between GPT3 and GPT4, however, the main performance gains are likely to result from model updates.
Second, humans and LLMs are both imperfect in establishing the true intent of authors, and thus in establishing the ground truth. One way to reduce noise in classifications is to examine the misclassified documents, correct any human mistakes in the coding and rerun the classification procedure (e.g. Nguyen-Trung, Reference Nguyen-Trung2024). This procedure provides insight into which arguments and texts are hard to classify, and simultaneously bolsters the model performance. The extent to which these improvements are effective depends on the number of flaws in human coding. The improvements in performance gained by applying this technique remains to be investigated.
Third, while the comparison of two datasets shows the general ability of ChatGPT to classify policy debates, the differences in performance between datasets may be attributed to a range of factors. The two datasets differ in the type of argument, the political context, the length of the text, the language and the sampled fraction of documents, as well as potentially the technical complexity of the policy issue. A comparison with more cases is needed, however, to isolate the exact sources of the variation in results. Future research should therefore investigate which of these differences explain the differential performance.
As a final remark, while we see future potential, caution is warranted. Data privacy concerns as well as possible climate impacts (e.g. water usage; see Li et al., Reference Li, Dada, Puladi, Kleesiek and Egger2024) and potential political bias (McGee, Reference McGee2023; Rozado, Reference Rozado2023) must not be neglected when using ChatGPT. Overall, however, we hope that future improvements in abilities will consolidate large language models as an important tool for social scientists.
Supplementary material
To view supplementary material for this article, please visit https://doi.org/10.1017/S0047279424000382
Competing interests
The author(s) declare none.