Content analysis is a valuable tool for analysing policy discourse, but annotation by humans is costly and time consuming. ChatGPT is a potentially valuable tool to partially automate content analysis for policy debates, largely replacing human annotators. We evaluate ChatGPT’s ability to classify documents using pre-defined argument descriptions, comparing its performance with human annotators for two policy debates: the Universal Basic Income debate on Dutch Twitter (2014–2016) and the pension reforms debate in German newspapers (1993–2001). We use the API (GPT-4 Turbo) and user interface version (GPT-4) and evaluate multiple performance metrics (accuracy, precision and recall). ChatGPT is highly reliable and accurate in classifying pre-defined arguments across datasets. However, precision and recall are much lower, and vary strongly between arguments. These results hold for both datasets, despite differences in language and media type. Moreover, the cut-off method proposed in this paper may aid researchers in navigating the trade-off between detection and noise. Overall, we do not (yet) recommend a blind application of ChatGPT to classify arguments in policy debates. Those interested in adopting this tool should manually validate bot classifications before using them in further analyses. At least for now, human annotators are here to stay.