Introduction
The assessment of receptive language skills in language tests is intricately linked to the consideration of input texts and their linguistic properties. Next to the item types and the proficiency level of a learner, it is the characteristics of the input texts that have a significant impact on the difficulty level of assessment tasks (Révész & Brunfaut, Reference Révész and Brunfaut2013; Toyama, Reference Toyama2021). Ensuring the comparability of input texts across different versions of examinations, particularly in the context of high-stakes standardized tests, is therefore crucial for test providers (Fitzgerald et al., Reference Fitzgerald, Elmore, Relyea, Hiebert and Stenner2016). Consequently, readability research focusing on the analysis of the linguistic complexity of input materials has garnered significant attention within the field of language testing, particularly concerning English (Chen & Sheehan, Reference Chen and Sheehan2015; Freedle & Kostin, Reference Freedle and Kostin1993).
The process of identifying and revising suitable authentic texts for assessment tasks as a pivotal step of item-writing is challenging and time-consuming (Green & Hawkey, Reference Green and Hawkey2011; Salisbury, Reference Salisbury2005). With the increasing refinement of artificial intelligence (AI) tools based on large language models (LLMs), such as ChatGPT, Llama and so on, generating input texts with the help of generative AI seems to be a promising option for developing test materials (Bolender et al., Reference Bolender, Foster and Vispoel2023; O’Sullivan, Reference O’Sullivan2023). So far, it remains under-researched whether LLMs can produce texts for specific assessment contexts that are of the same quality and linguistic complexity as non-AI texts selected and adapted for assessment purposes by professional test writers. Especially for German, research has yet to systematically explore the capabilities and limitations of LLMs in generating texts for assessment purposes.
While larger English-speaking testing institutions have begun developing customized test development engines powered by LLMs (Attali et al., Reference Attali, Runge, LaFlair, Yancey, Goodwin, Park and von Davier2022; Bolender et al., Reference Bolender, Foster and Vispoel2023), there is still limited research on the extent to which currently available LLMs can produce high-quality input texts that align with specific levels of the CEFR levels, particularly in languages other than English. Given that input texts play a crucial role in determining task difficulty, understanding the linguistic characteristics and limitations of AI-generated texts is essential – not only for test developers but also for broader language learning and teaching contexts. This study contributes to this research gap by systematically analyzing the linguistic properties of AI-generated texts in comparison to benchmark texts used in a high-stakes German language exam, with a focus on their suitability for assessing academic reading proficiency at the B2/C1 level.
Literature review
LLMs
Generative LLMs are a particular type of language model with the capability to both encode and decode human language. Due to massive amounts of training data and the complexity of the model architecture, they can handle various topics in many languages (Min et al., Reference Min, Ross, Sulem, Veyseh, Nguyen, Sainz and Roth2023). They learn their knowledge about language from the co-occurrence of words in huge text corpora. Thus, they can produce fluent, coherent, and mostly error-free texts (Adesso, Reference Adesso2023). However, LLMs tend to hallucinate, that is, they make up facts (Alkaissi & McFarlane, Reference Alkaissi and McFarlane2023) without indicating so, and references, if given in a text, are often also made up and non-existent in reality (Ray, Reference Ray2023).
LLMs work better for languages for which they have seen more training data and often show a bias towards a US-centric view (Feng et al., Reference Feng, Park, Liu and Tsvetkov2023). Although it is unknown what exactly the training data for ChatGPT looks like, one can estimate through comparisons with web content that it might have been exposed to about 10 times more content in English than in German (Petrosyan, Reference Petrosyan2024).
In the context of reading comprehension, LLMs have been used to produce simplified versions of authentic texts for their use in English-as-a-foreign-language settings (Young & Shishido, Reference Young, Shishido and Bastiaens2023) or reading comprehension tasks (Shin & Lee, Reference Shin and Lee2023; Xiao et al., Reference Xiao, Xu, Zhang, Wang and Xia2023), as detailed below.
Criteria of good input texts for assessment tasks
The routine of item writers encompasses not only the adaptation and revision of stimulus texts and the production of items but also the challenging process of text sourcing, which involves finding potential texts (Green & Hawkey, Reference Green and Hawkey2011; Salisbury, Reference Salisbury2005). In developing reading tasks for assessment purposes, it is essential to follow certain rules in the selection of written texts. A critical requirement for the texts is their relevance to and representativeness of the target use domain and/or alignment with the designated learning objectives (Chapelle & Lee, Reference Chapelle, Lee, Chapelle and Voss2021). In the context of evaluating reading proficiency for university admission purposes, it is, for example, imperative to employ texts reflective of those types of written texts that prospective students are likely to encounter during their studies, thereby ensuring content validity of the test (Green & Hawkey, Reference Green and Hawkey2011). While the information contained in the input texts should be understandable to non-experts, it should not be self-evident or common knowledge of the target group. Furthermore, the texts must align closely with the test specifications, which include, among other things, the assessed level of language proficiency, the target group, the genre, style, length, and subject matter of the text (Brunfaut, Reference Brunfaut, Fulcher and Harding2021).
The potential suitability of a text to facilitate the generation of specific item types to measure certain constructs should also be considered. For instance, from the perspective of a test construct, the ability to recognize text structure and overarching causal relationships is evaluated by requiring test-takers to arrange sections of a text in the correct order. Such a text, therefore, must be divisible into clear sections marked by appropriate connectors. Additionally, from the standpoint of item format, if multiple-choice questions are used, the text should encompass a breadth of information points, which are important for the development of plausible distractors (Brunfaut, Reference Brunfaut, Fulcher and Harding2021).
Another important aspect regarding the suitability of a text as an input text is its complexity. This line of research has been so far conducted primarily for English tests (Chen & Sheehan, Reference Chen and Sheehan2015; Freedle & Kostin, Reference Freedle and Kostin1993). For example, using TextEvaluator, Chen and Sheehan (Reference Chen and Sheehan2015) compared the stimulus material of TOEFL Primary, Junior, and iBT according to eight groups of features that refer to syntactic complexity, vocabulary difficulty, academic orientation, argumentation, concreteness, cohesion, degree of narrativity, and style. The calculated ranges of scores, derived from the distributions of overall complexity as well as the component scores for each passage at various test levels, are used as benchmarks for the development or selection of new passages.
For the German language, there are several developmental projects in the area of German language learning that aim at assigning a CEFR level to German reading texts, for example, the DaFLex project, the CEFRSERV project as part of the European Language Grid (Rehm et al., Reference Rehm, Berger, Elsholz, Hegele, Kintzel, Marheinecke and Klejch2020) or the Level-Adequate Texts in Language Learning Project (Vázquez-Ingelmo et al., Reference Vázquez-Ingelmo, García-Holgado, Therón, Shoeibi, García-Peñalvo, Saltiveri, Veloso, Navarro, González, Cairol, Solé and Gomà2023). However, to our best knowledge, there is no published research in the context of language assessment for the German language that compares human and AI-generated input texts for different CEFR proficiency levels.
LLMs and generation of reading assessment materials
Despite the expanding body of work on using AI for item generation (Bolender et al., Reference Bolender, Foster and Vispoel2023; Pugh et al., Reference Pugh, De Champlain, Gierl, Lai and Touchie2020), little empirical research in the area of language testing so far has focused on the capability of generative AI to produce high-quality input texts for specific assessment purposes. In an EFL context, Attali et al. (Reference Attali, Runge, LaFlair, Yancey, Goodwin, Park and von Davier2022) investigated the quality of reading passages and items, which were generated using the GPT-3 model family and few-shot conditioning, through psychometric characteristics of the items as well as content and fairness reviews. While the authors outlined the criteria underpinning the evaluation of the LLM-generated test material, namely, content appropriateness, cohesion, clarity, and logical consistency, and reported that a total of 58% of passages were retained as a result of “all reviews and adjudication,” they did not explicitly specify the shortcomings of the AI-generated passages that resulted in the exclusion of certain passages. Furthermore, and crucially, the study did not include benchmark texts as the basis for the analysis nor perform a linguistic analysis of the textual features. Additionally, it is noteworthy that the longest passages produced in their test context were capped at 175 words. This limitation is particularly relevant when compared to the longer texts usually employed in language admission exams within the German context, suggesting an area for further exploration.
Two other studies, each focusing on the English language as well, have contributed to our understanding of the capabilities of LLMs in generating reading comprehension test items by comparing them with benchmark texts. Specifically, this research examined the applications within the English section of the College Scholastic Ability Test in South Korea, as reported by Shin and Lee (Reference Shin and Lee2023), and in the context of middle school English learning in China, as detailed by Xiao et al. (Reference Xiao, Xu, Zhang, Wang and Xia2023). Despite utilizing different LLMs – Shin & Lee employed ChatGPT-3.5, while Xiao et al. evaluated a range of content generation models including a fine-tuned GPT-2, ChatGPT in a zero-shot configuration, and ChatGPT in a one-shot scenario – both studies discovered that the LLM-generated texts were of a quality that was at least on par with, if not superior to, existing human-authored materials. For several reasons, the findings from these studies may not be directly applicable to our research context. Not only does our study focus on a different language and target proficiency, but it also takes a different methodological approach.
Shin and Lee’s (Reference Shin and Lee2023) human evaluation focused on two aspects of the quality of the generated texts only (the natural flow of the passages and the naturalness of the English expressions) and did not incorporate any computational analysis. In contrast, Xiao et al.’s (Reference Xiao, Xu, Zhang, Wang and Xia2023) approach can be praised for its robust design, as it combined human and computational analyses. Their computational analysis focused on five features: negative log-likelihood loss, SMOG and Flesch grade, type-token ratio (TTR), and the proportion of repeated n-grams. Additionally, human evaluation assessed five aspects of the generated texts: readability, correctness, coherence, engagement, and overall quality. However, the study did not include several features relevant to academic reading passages, such as alignment with the target genre or analysis of syntactic complexity. Moreover, both studies relied on previous versions of ChatGPT available at the time of their research.
In her discussion of key validity issues for using generative AI in test development, Xi (Reference Xi2023) formulated an important question that test users should ask and test providers should answer, namely “is there research evidence that shows AI-generated test content edited by trained test developers can emulate the quality of test content created by human developers entirely?” (p. 369). Our exploratory study seeks to address a closely related question, namely how AI-generated material compares with content created by test developers based on non-AI resources, such as online magazines and other websites. This is done in the context of a reading comprehension task of a German exam for university admission purposes through the systematic analysis of the generated texts. In doing so, our study addresses a preliminary step towards answering the question posed by Xi. By examining the differences between these two types of material, we hope to identify those text features which test developers should pay attention to when editing AI-generated content for testing purposes. In this study, we focus on longer input texts. We combine a detailed computational analysis using a wide variety of linguistic features with a human review and original assessment texts as a benchmark.
Purpose of the study and research questions
The study aims at investigating the potential of generative AI to produce high quality input texts for reading comprehension by exploring the nuances of language production by AI in comparison to human test development, particularly focusing on two main research questions (RQs):
RQ1: How do AI-generated texts and texts created by human developers (benchmark texts) differ in their linguistic features?
RQ2: What differences do experienced test developers see between the two text types?
Methodology
Study context and target task
The study was conducted in the context of TestDaF, a standardized language test for university admission purposes in Germany (Norris & Drackert, Reference Norris and Drackert2018). Since its primary purpose is to determine whether a candidate’s German language proficiency in four language skills is sufficient for participation in German university studies, the exam includes authentic tasks and texts that students encounter at the starting phase of their studies and are similar to those which students might read in a university course or on campus.
The reading section of the paper-based TestDaF comprises three tasks (for a sample test, see g.a.s.t., TestDaF-Institut, 2020). Task 2, chosen for this study, is based on a reading passage of around 500 words with 10 multiple-choice items and its level corresponds to levels B2.2./C1.1 of the CEFR. The reading passages for this task consist of a report on an academic topic, outlining either a scientific study or the latest academic findings on a scientific phenomenon. They are based on texts that are taken from popular science magazines, or the websites of universities and scientific institutions. Ten multiple-choice items test the understanding of main ideas as well as detailed information from the input text. The last item is a “macro” item relating to the text as a whole.
Data
Benchmark texts
A total of 30 input texts for reading comprehension Task 2 written by experienced human test writers and used in the TestDaF exams were randomly selected for this study and served as a benchmark.
AI-generated texts
Based on the test specifications provided for item writers, we generated 30 texts on the topics of the benchmark texts using ChatGPT-3.5 and 4 each with the following prompt: Write a report about a scientific study on the state of the art of research about “a name-of-a-specific-phenomenon” with many details. The text should be between 450 and 550 words long and must not include any lists or headings. This prompt was a result of prior try-outs with different prompts. For example, we initially did not include the specification that the text should not contain any lists or headings. However, upon generating texts across five topics, it became apparent that the output frequently featured excessive lists and multiple subheadings, rendering them less suitable for task creation due to their overly detailed organization. This instruction was provided in German without further specifications: We did not provide the target CEFR-levels since previous research has shown that LLMs do not have accurate knowledge of the CEFR (Benedetto et al., Reference Benedetto, Gaudeau, Caines and Buttery2025).
A total of 90 texts belonging to three text types were used for the analysis in this study as presented in Figure 1.

Figure 1. Illustration of the research design.
Computational analysis of the texts
Analyzed features
In our comparison of the three different text types, we used a set of linguistic features (N = 61) on different linguistic granularity levels (see Table 1 for an overview). Features from these categories were identified as indicators of linguistic complexity in previous work (see, for example, Hancke et al., Reference Hancke, Vajjala and Meurers2012; Weiß, Reference Weiß2024; Weiss & Meurers, Reference Weiss and Meurers2018). We explored a broad variety of textual features to identify possible differences between the three text types. We included features with a clear reference to important characteristics of the task at hand such as syntactic complexity, breadth of vocabulary or style (nominal vs verbal), making sure that the target construct is sufficiently covered.
Table 1. Overview of the analyzed features per category

Traditional readability measures
We used several traditional readability measures originally developed for English, which have been applied to German (Hancke et al., Reference Hancke, Vajjala and Meurers2012), as well as the Wiener Sachtextformel (Bamberger & Vanecek, Reference Bamberger and Vanecek1984), a readability metric specifically designed for German. Most readability formulas combine word complexity, often approximated through the number of syllables or characters per token, with sentence complexity, often measured by sentence length in tokens.
Lexical features
Our lexical features measured vocabulary breadth by using a corrected variant of TTR. TTR in its basic form is known to be dependent on text length (Koizumi & In’nami, Reference Koizumi and In’nami2012), therefore we used Moving Average TTR (MATTR, Covington & McFall, Reference Covington and McFall2010), where equally-sized segments (100 tokens in our case) are repeatedly extracted from a text, and the average over the individual TTR values for these segments is used to model lexical variation. We further computed TTR for individual part-of-speech (POS) classes as well as lexical variation, that is, TTR for content words (nouns, verbs, adjectives) only, and measured lexical density as the relative frequency of content words among all words (see also Lu, Reference Lu2014, p. 80).
POS features
Measuring the distribution of individual POS, such as verbs, nouns, adjectives, and so on, allows inferences on the relation between nominal versus verbal style in a text, the frequency of subordinate clauses (introduced by relative pronouns or conjunctions) or filler words, such as particles. It also gives us information about the usage of pronouns in comparison to common nouns or proper nouns. For this group of features, we evaluated the relative frequency of individual POS tags. As tag set, we used the coarse-grained Universal Dependency Tagset (Petrov et al., Reference Petrov, Das and McDonald2012) with 12 different POS tags, of which we included 10 in our analyses.
Morphological features
Morphological features measure aspects of word formation. We analyzed the texts using the Mate Morphological Tagger (Björkelund et al., Reference Björkelund, Bohnet, Hafdell and Nugues2010) and counted the relative frequency of nouns in different cases (nominative, genitive, dative, and accusative) among all nouns. As nominalizations play an important role in our text genre, we measured the frequency of different derivational suffixes used to form nouns in German (such as “-heit,” “-keit,” “-ung”) in relation to the overall number of nouns (see Hancke et al., Reference Hancke, Vajjala and Meurers2012). We further computed the ratio of finite verbs among all verbs (as a proxy for the complexity of the verb phrase) and the frequency of passive constructions.
Syntactic features
Our syntactic features model the grammatical complexity of sentences in a text. We measured the maximal and average depth of parse trees, which were obtained from the CoreNLP Parser (Manning et al., Reference Manning, Surdeanu, Bauer, Finkel, Bethard and McClosky2014), for individual sentences in a text with more deeply nested sentences receiving higher values (Chen & He, Reference Chen, He, Yarowsky, Baldwin, Korhonen, Livescu and Bethard2013). We further computed the average number of relative clauses per sentence, the number of infinitive clauses per sentence, and the number of subordinate clauses introduced by a subordinate conjunction. Table 1 gives an overview of the analyzed features per category.
Text processing
All texts were linguistically processed using the LiFT toolkit (Zesch et al., Reference Zesch, Horbach, Weiss, Aggarwal, Bewersdorff, Bexte and Westphal2021). This java-based toolkit makes use of various NLP preprocessing components provided through DkPro Core (Eckart de Castilho & Gurevych, Reference Eckart de Castilho and Gurevych2014), such as tokenization, POS tagging, lemmatization, and parsing; and integrates the individual feature extractors in an UIMA pipeline (Ferrucci & Lally, Reference Ferrucci and Lally2004).
Statistical analysis
We computed the descriptive statistics for the features grouped by categories. To investigate the presence of statistical significance among individual features across the three distinct types of texts, we employed the Kruskal–Wallis test. This non-parametric method serves as an alternative to the Analysis of Variance when the assumptions of normal distribution and homogeneity of variance are not met, making it particularly suited for our dataset, which comprises frequency occurrences. The assumption of unequal variances was confirmed through Levene’s test. For analyzing statistically significant differences between the three types of texts, we applied the Mann–Whitney U-test as a post-hoc analysis with two degrees of freedom (df = 2).
Human review
To enrich the computational quantitative analysis with an additional qualitative perspective, we asked two experienced TestDaF team members with extensive expertise in assessment and item writing within the relevant context to assess a subset of texts used in the automated analysis. In total, each of them evaluated 20 texts, comprising 10 generated by the latest version of ChatGPT4 and 10 benchmark TestDaF texts, against a set of predefined criteria. The human reviewers were unaware of the study’s objectives and the fact that half of the texts were AI-generated. In total, they provided 40 evaluations – 20 for the AI-generated texts and 20 for the benchmark texts – for the specified criteria regarding vocabulary (Qs 1–2), syntactic complexity (Q3), and content/genre (Qs 4–7). They were also asked to identify any peculiarities within the texts and to provide specific examples of those (see Appendix 1 for the questionnaire). We conducted a Mann–Whitney U-test to investigate if the ratings for two text types differ statistically. Responses to the open-ended question where manually categorized with particular attention paid to multiple mentions of specific text features.
In addition, a third expert responsible for testing receptive skills at the TestDaF-Institute undertook a thorough examination of 30 texts generated by ChatGPT-4. In particular, this reviewer analyzed the texts regarding the correspondence to the target CEFR level as well as the known problems of LLM-generated texts such as biases and hallucinations.
Figure 1 gives an overview of the data and its analyses.
Results
In this section, we present a detailed comparison of linguistic features across the three text types employing computational analysis and the human evaluation of a sample of texts.
Computational analysis of texts
The results of the computational analysis of texts will be presented according to the feature categories analyzed.
Readability measures
The descriptive statistics provided in Appendix 2 (Table 2.1), along with the box-plot illustrations in Figure 2 depicting a selection of readability indices, demonstrate that texts generated by both versions of ChatGPT exhibit a higher score, that is, are more complex, compared to the benchmark texts. With the exception of the Coleman–Liau index, the observed differences in readability scores are statistically significant for all three text types as shown in significance tests for pairwise comparisons in Table 2.3 in Appendix 2. The biggest difference in the post-hoc Mann–Whitney U-tests for readability measures between the benchmark texts and ChatGPT4 texts was found for the ARI index (U = −56.667, p < .001).

Figure 2. Box-plots for five readability measures for the three text types.
Lexical measures
The descriptive statistics provided in Appendix 3, along with the box-plot illustrations in Figure 3, indicate that benchmark texts tend to have a higher MATTR than both types of ChatGPT texts (Mbenchmark = 0.76; MChatGPT-3.5 = 0.70; MChatGPT-4 = 0.72), with the differences being statistically significant as shown in the post-hoc Mann–Whitney U-tests in Table 3.3 in Appendix 3. Regarding lexical variation, benchmark texts tend to be closer to ChatGPT4 texts, but the texts do differ significantly on this measure. At the same time, ChatGPT texts tend to have more content words in comparison to all words and are thus more lexically dense, as seen in mean comparisons and significance tests in Tables 3.1 and 3.3 in Appendix 3 (Mbenchmark = 0.45; MChatGPT-4 = 0.50). In examining lexical density across text types, a Mann–Whitney U-test revealed significant differences between the benchmark and ChatGPT-4 texts (U = −37,400, p < .001), as well as between the benchmark texts and ChatGPT-3.5 texts (U = −45,800, p < .001).

Figure 3. Box-plots for lexical measures for the three text types.
POS analysis
As visualized in the box-plots in Figures 4 and 5 and reported in the Kruskal–Wallis test results, significant differences in the relative frequency of individual POS tags between three text types were found for all parts-of-speech but for prepositions.

Figure 4. Box-plots for relative frequency of content words for the three text types.

Figure 5. Box-plots for relative frequency of function words for the three text types.
In particular, as subsequent post-hoc Mann–Whitney U-tests (Table 4.3 in Appendix 4) showed, ChatGPT-4 texts tend to contain significantly more adjectives (U = −41,067, p < .001), more conjunctions (U = −30,833, p < .001), more determiners (U = −38,067, p < 001), and more nouns (U = −45,883, p < .001) than benchmark texts. At the same time, benchmark texts tend to have more adverbs, more numerals, more pronouns, more proper nouns, and more verbs (see Table 4.3 in Appendix 4).
Morphological complexity
In the investigation of 25 morphological features encompassing various dimensions of word formation related to verbs and nouns, the analysis revealed significant differences in several measures. Notably, as illustrated in Figure 6 and corroborated by statistical evidence in Appendix 5, the finite verb ratio and the prevalence of passive sentences were significantly higher in the benchmark texts compared to those generated by ChatGPT-4. This difference was substantiated by post-hoc Mann–Whitney U-tests, which yielded values of U = 30.683, p < .001 for the finite verb ratio and U = 21.200, p = .002 for the frequency of passive sentences. By contrast, ChatGPT texts of both types tend to contain more nominalizations as measured by the feature frequency of all suffixes (Mbenchmark = 0.17; MChatGPT-3.5 = 0.39; MChatGPT-4 = 0.41) with the differences being statistically significant as seen in Mann–Whitney U-tests in Table 6.3 in Appendix 6.

Figure 6. Box-plot for morphology features for the three text types.
The analysis of case usage across three text types revealed distinct patterns: ChatGPT-generated texts utilize more nouns in the genitive and accusative cases, while benchmark texts contain more instances of the nominative case. The most pronounced difference was noted in the genitive case, as confirmed by the Kruskal–Wallis test (H = 20.067, p < .001). No significant differences were observed in the usage of the dative case.
Syntactic complexity
Significant differences between the three text types were also found for the measures of syntactic complexity. In particular, ChatGPT-4-generated texts tend to have on average a more deeply-nested sentence structure than benchmark texts and ChatGPT-3.5-generated texts (Mbenchmark = 5.08; MChatGPT-4 = 5.89), U = −39.350, p < .001. At the same time, the most complex sentence in benchmark texts is on average deeper nested than the most complex sentence in AI-generated texts (Mbenchmark = 11.33; MChatGPT3.5 = 8.03; MChatGPT4 = 9.60) and confirmed by Mann–Whitney U-tests in Table 6.3 in Appendix 6.
When comparing the average number of different types of clauses per sentence as visualized in Figure 7 and summarized in Tables 6.1–6.3 in Appendix 6, ChatGPT-4-generated texts have the highest proportion of three types of clauses: subordinate clauses, infinitive clauses, and relative clauses. Furthermore, ChatGPT-generated texts have a higher average number of connectives per sentence (Mbenchmark = 1.24; MChatGPT-3.5 = 1.55; MChatGPT-4 = 1.92), with Mann–Whitney U-tests showing statistical significance between three types of texts at the p-level lower than .003.

Figure 7. Box-plot for syntactic features for the three text types.
Human review
The analysis of human evaluations (see Table 2) revealed that ChatGPT-4-generated texts received higher scores across two linguistic categories (vocabulary and syntax), suggesting they are perceived as more complex by experts. However, the differences were not statistically significant. Regarding content, benchmark texts were found to contain more examples that illustrate the described phenomena (Mbenchmark = 1.95, SD = 0.61) compared to ChatGPT-4-generated texts (MChatGPT-4 = 0.9, SD = 0.55) on a scale from 0 to 4, with a U-value of 127 and a significance level of p < 0.05. Furthermore, evaluations of text coherence, the origin of texts, and content predictability revealed no statistically significant differences between benchmark and ChatGPT4-generated texts.
Table 2. Human reviewers’ mean ratings of two text types

The analysis of the comments in the open-ended questions revealed further differences between two text types. Human reviewers pointed to some cases where the use of some expressions was not idiomatic and seem to include translations from English, for example, Rollenmodelle (role models) and nicht-CO2-Effekte (non-CO2 effects). For these terms, more idiomatic German expressions exist (Vorbilder; Effekte, die nicht durch CO2 verursacht werden). Sometimes English expressions were used in the texts (e.g., “ride-hailing,” “cetacean stranding”) or technical terms that should ideally not be used or at least be explained, for example, Sonar-Navigationssysteme and geomagnetische Anomalien (“sonar navigation systems” and “geomagnetic anomalies”). Furthermore, human reviewers emphasized a large number of nominalizations in ChatGPT-generated texts and their rigid structure.
A third expert in test development conducted a qualitative analysis of 30 ChatGPT-4-generated sample texts, focusing on CEFR level alignment, bias, and hallucinations. The expert found a good correspondence with the B2.2/C1.1 levels and no blatantly incorrect information. Fact-checking confirmed the accuracy of content, which provided informative overviews accessible to readers from diverse academic backgrounds. However, the expert identified some weaknesses. First, all ChatGPT-4 texts followed the same general structure. A short introduction lists the main ideas that are to be discussed in the main body of the text. The texts conclude with a brief summary of the main ideas and a rather generic outlook on the future relevance of the topic. In contrast, the benchmark texts showed more structural variety. In addition, the AI-generated texts tended to list information rather than explain causal relationships. In some cases, especially in scientific outlines, the texts lacked the necessary detail, focusing on general information rather than specific study findings. Yet, when reporting recent scientific phenomena, these discrepancies were absent.
Discussion
This article delves into the role of LLMs in language testing and contributes to the understanding of the nuanced interplay between input texts, linguistic features, and the efficacy of LLMs. Specifically, the study explored the comparability between benchmark texts employed in a high-stakes German examination and LLM-generated texts, addressing a context that has been hitherto underrepresented in the domain of language testing research – that of German language exams.
The readability analysis revealed that ChatGPT-4 texts are more complex than those generated by ChatGPT-3.5 and benchmark texts, with longer sentences and words potentially posing increased cognitive demands on readers. Lexical analysis showed that, while benchmark texts have a broader lexical diversity across all parts-of-speech, ChatGPT-4 texts show greater variation in content words and a higher lexical density, indicating a denser concentration of information. Human reviewers noted no significant differences in vocabulary breadth but pointed out the AI-generated texts’ use of technical and infrequent vocabulary.
In terms of parts-of-speech, it was found that ChatGPT-4 texts contain significantly more adjectives, conjunctions, determiners, and nouns, whereas benchmark texts have higher frequencies of adverbs, numerals, pronouns, proper nouns, and verbs. Morphologically, benchmark texts show a higher use of finite verbs and passive constructions, while ChatGPT texts have more nominalizations. Notably, ChatGPT texts contain more genitive and accusative cases, adding complexity to the text. Syntactically, ChatGPT-4 texts are characterized by deeper nested structures and a higher use of various clauses and connectives, compared to benchmark and ChatGPT-3.5 texts. Despite the more rigid structure of AI-generated texts, human reviews evaluated their coherence as similar to that of benchmark texts.
Our findings suggest that AI-generated texts can be used as a starting point for creating new text inputs for Reading Task 2 of the paper-based TestDaF, especially for those instances of Task 2 that describe the state of the art of a scientific phenomenon. However, these texts should always be refined by experts. The differences in language, structure, and content revealed by the analyses provide clear targets for refinement of the AI-generated texts during the revision stage and for the refinement of the prompt used. For example, linguistically, the texts have to be checked for English expressions and technical terms as well as non-idiomatic expressions resulting from inadequate translations from English, a finding that is to be expected, taking into account that ChatGPT was trained primarily on English data (Zhang et al., Reference Zhang, Li, Hauer, Shi and Kondrak2023) and tends to use English as their internal pivot language (Wendler et al., Reference Wendler, Veselovsky, Monea and West2024). Also, passages with clusters of nominalizations would have to be rephrased by using verbs and subordinate clauses, and as higher readability indices indicated, AI-generated texts might benefit from some simplification. The rigid structure that was observed in our data would benefit from a greater variety. Most importantly, however, the texts require more detailed information and examples in order to provide sufficient material for writing plausible and unambiguous multiple-choice items (Brunfaut, Reference Brunfaut, Fulcher and Harding2021).
The analysis revealed that the texts did not provide false information as suggested in the literature (Alkaissi & McFarlane, Reference Alkaissi and McFarlane2023) or bias (Feng et al., Reference Feng, Park, Liu and Tsvetkov2023) but included references that were made up primarily with regard to the scientific studies (Ray, Reference Ray2023). Both aspects would need to be checked when using AI-generated texts.
The need for such revisions could be reduced by adapting the prompt. For instance, the prompt could ask for more details and relevant examples, and the text length could be increased in order to generate more material from which non-suitable sections could be removed. Furthermore, few-shot prompting, which involves providing an LLM with a few example texts to guide its generation, might improve their quality. By providing an LLM with samples of benchmark texts, and asking it to produce similar texts on a different topic, the generated texts will likely have more of the desired features than texts generated by zero-shot prompting. However, it is still to explore whether these revised prompts would have the desired effects since it has been shown that LLMs “cannot leverage (the information about the CEFR) to accurately perform educational tasks” (Benedetto et al., Reference Benedetto, Gaudeau, Caines and Buttery2025, p. 13). Furthermore, in order to ensure test security, this method would only be suitable for use with customized LLMs or the workspace version of ChatGPT. In any case, a systematic follow-up analysis and evaluation of the texts by first language speakers are necessary.
When considering the implications for the differences found, we have to differentiate between those that can directly inform or instruct item developers how to change a text and those which point at small differences that might have little relevance in real life even if they are statistically significant. A finding that an AI-generated text has 50% more adverbs than a human-written text could lead to specific revisions of the generated text. In contrast, a slightly higher TTR value of a specific text type, though statistically significant, might not require immediate action.
The research reported in this article builds upon prior studies on AI-generated content suitability for test purposes (Attali et al., Reference Attali, Runge, LaFlair, Yancey, Goodwin, Park and von Davier2022; Shin & Lee, Reference Shin and Lee2023; Xiao et al., Reference Xiao, Xu, Zhang, Wang and Xia2023) by expanding the evaluation criteria and parameters through a more comprehensive linguistic analysis and a direct comparison with benchmark texts. These criteria may vary in importance depending on the proficiency level, target language, or genre. For instance, passive constructions may be less relevant at lower proficiency levels.
Our findings emphasize the need for a hybrid evaluation approach that combines human review with computational analysis. For instance, the computational analysis has demonstrated that AI-generated texts often contain fewer proper nouns or numerals compared to benchmark texts. However, it is only through human analysis that these characteristics can be explained by the lack of specific examples and detailed information in the descriptions of research studies within the generated texts.
Limitations
This study represents a cross-sectional analysis, that is, it captures the development at a certain point in time. The findings might not be applicable to future versions of ChatGPT. The intransparency of commercial LLM solutions and the to-date inferior performance of open-source solutions (Gudibande et al., Reference Gudibande, Wallace, Snell, Geng, Liu, Abbeel and Song2023) makes it infeasible to track changes in an AI system with certainty. Therefore, more research is needed to confirm the findings of this study. Also, a direct comparison between both ChatGPT versions was out of scope for our study, although such an analysis could be done based on the statistical data provided in Appendices 2–6.
In our study, we opted for a fixed prompt and did not systematically explore the vast option space for prompt engineering and refining the obtained results through multiple dialogue turns. This strategy has been successfully applied in, for example, the domains of question answering (Liu et al., Reference Liu, Ping, Roy, Xu, Lee, Shoeybi and Catanzaro2024) or information extraction (Wei et al., Reference Wei, Cui, Cheng, Wang, Zhang, Huang and Han2023), and we are aware that this might have a large influence on text quality. Furthermore, we did not fine-tune the LLM using human-generated texts as training material – also for the reason that such texts should not be made public or exposed to a language model.
Concluding remarks
This study offers empirical evidence that can assist test developers and developers of learning materials in refining AI-generated text content to match the quality of content created by human developers based on non-AI sources. By transparently highlighting the differences between AI-generated texts and benchmark texts (Burstein, Reference Burstein2023) within the framework of a German reading comprehension exam used for admission purposes, this research enhances our understanding of the potential applicability of LLMs in new contexts. Notably, we utilized a widely accessible LLM for text generation, making our findings particularly relevant for test developers and language teachers who may not have extensive resources available. The importance of conducting further replication studies cannot be overstated, as these would not only validate our results but also aid LLM users in comprehending ongoing changes in the technology. Additionally, extending this research to languages other than English could provide broader insights into the adaptability and effectiveness of LLMs across different linguistic landscapes.
Acknowledgments
We would like to thank colleagues from g.a.s.t. e.V. and anonymous reviewers for constructive feedback on the previous versions of the manuscript.
Appendix 1
Questionnaire for Evaluating Reading Texts for Authors
1. The text contains terms that require specialized knowledge.
2. The vocabulary used is broad.
3. The text includes complex linguistic structures (e.g., passive voice, subjunctive II, nominal style, subordinate clause structures, participial structures, infinitive constructions).
4. This text could appear in a popular science magazine.
5. The content can be understood based on general knowledge.
6. There are examples that illustrate the topic and the content.
7. The text is coherent.
Response options:
1 (strongly disagree) 2 3 4 (strongly agree)
Open-ended question: Are there any peculiarities in TEXT 1-20? Provide specific examples from the text.
Appendix 2 Statistics for readability measures
Table 2.1. Descriptive statistics for readability measures for the three types of texts

Table 2.2. Kruskal–Wallis test of statistical significance for readability measures (df = 2)

Table 2.3. Mann–Whitney U Tests of the significant results for readability measures

Appendix 3 Statistics for lexical measures
Table 3.1. Descriptive statistics for lexical measures for the three types of texts

Table 3.2. Kruskal–Wallis test of statistical significance for lexical measures (df = 2)

Table 3.3. Mann–Whitney U tests of the significant results for lexical measures

Table 3.4. Descriptive statistics for POS TTR for the three types of texts

Table 3.5. Kruskal–Wallis test of POS TTR for lexical measures (df = 2)

Table 3.6. Mann–Whitney U tests of the significant results for POS TTR

Appendix 4 Statistics for POS measures
Table 4.1. Descriptive statistics for POS measures for the three types of texts

Table 4.2. Kruskal-wallis test of statistical significance for POS measures (df = 2)

Table 4.3. Mann–Whitney U tests of the significant results for POS measures

Appendix 5 Statistics for morphological measures
Table 5.1. Descriptive statistics for morphological measures for the three types of texts

Table 5.2. Kruskal–Wallis test of statistical significance for morphological measures (df = 2)

Table 5.3. Mann–Whitney U tests of the significant results for morphological measures

Appendix 6 Statistics for syntactic measures
Table 6.1. Descriptive statistics for syntactic measures for the three types of texts

Table 6.2. Kruskal–Wallis test of statistical significance for syntactic measures (df = 2)

Table 6.3. Mann–Whitney U tests of the significant results for syntactic measures
