I watched the young man, dressed in scuffy denim and T-shirt, dark curly head bent in concentration, juggling a book and a stack of papers on his lap as the subway lurched through the darkness. At first glance, I thought that he was working on his assignment, writing quickly and intently, in a race with the subway and its inevitable stops. As a load of passengers spewed into the car, he shifted slightly to accommodate the new crush, and it was then that I saw varied and youthful handwriting on his pages. He was grading those papers, not composing his own … (Miller, Reference Miller1986, p. 26).
Much current exploration of GenAI-assisted language learning (yes, GALL) seems determinedly opposed to the idea of young men juggling books and papers in trains. Scholars want students to be free from the need to wait and the fear of embarrassment (Barrot, Reference Barrot2023; Hong, Reference Hong2023; Godwin-Jones, Reference Godwin-Jones2022). The passage below echoes the optimism expressed by most scholars exploring GenAI’s affordances:
Best of all, the feedback is instant, unlike teacher feedback, which understandably takes time … by the time students receive their feedback, they might have completely forgotten everything from the previous writing. All in all, ChatGPT is a free and more efficient alternative to human tutors. (Hong, 2023, p. 40)
As a teacher myself, I too appreciate peaceful train rides unmolested by the need to grade papers. Any teacher you meet will tell you that he or she prefers a page from Murakami or Austen to a hastily scrawled 850-word composition on the pros and cons of smartphones. Every encounter with a subject-verb error, misplaced comma, and dangling modifier is another reminder that there’s more work to do.
Yet I believe teachers will continue grading and students will continue to wait, not because we don’t have faith in the powers of GenAI, but because we know the value of the apparently Sisyphean slog that so much of teaching appears to be. While often persuasive and clearly well-intentioned, exploration of GenAI’s affordances has to date often left unexamined the rich and complex work that grading and other seemingly thankless tasks do for both teachers and learners. To develop strategies and principles for the integration of GenAI chatbots into instructional contexts, the current wave of GenAI research needs to engage in deeper dialogue with what second language acquisition (SLA) scholarship has learnt about vital language learning processes, including feedback, planning, and exposure to model texts.
1. Instant feedback whenever you want it!
Few would disagree with the notion that writers need feedback: it helps learners identify gaps in their repertoire and confirms the hypotheses that they make when they experiment with newly acquired language (Swain, Reference Swain, Cook and Seidlhofer1995). With these principles in mind, many scholars have pushed for greater adoption of GenAI as a source of instant feedback.
Yet feedback can come in many forms, and not all of them enjoy a history of uncontested beneficence. Most scholars hail GenAI’s ability to correct grammatical errors as one of its chief affordances (e.g., Barrot, Reference Barrot2023; Hong, Reference Hong2023; Su et al., Reference Su, Lin and Lai2023), but this may not necessarily benefit learners. Written corrective feedback (WCF) can be comprehensive or selective, depending on how many error types a teacher chooses, and they can be selected before or after grading. In general, evidence from the field suggests that focusing on a smaller set of error types is both more manageable in terms of the cognitive load learners incur and the impact it has on motivation (Lee, Reference Lee2020). In contrast, comprehensive WCF may overwhelm and discourage learners, especially beginning writers already lacking in confidence and skill. Besides the question of how much feedback to give, there has also been extensive debate over whether WCF should come in the form of explicit correction or hints without correction. Here, results have been less conclusive, and suggest that various factors, including task type, student proficiency, and error type, determine which might have more positive impact on learners. Broadly speaking, hints work when students can correct errors independently, whereas explicit WCF works best for more complex errors (Lee, Reference Lee2013). Since hints encourage greater cognitive engagement, providing explicit correction indiscriminately – which is almost always likely to be what a chatbot does, unless users request hints instead – deprives more advanced learners of the chance to work things out themselves.
Yet all this is not to suggest that GenAI feedback cannot benefit learners. Unlike a teacher, a GenAI chatbot cannot readily determine how likely a particular learner is to correct an error himself or herself; this problem can be solved if a learner first asks for indirect WCF before asking a chatbot to provide explicit corrections, thus allowing him or her to benefit from the higher cognitive engagement afforded by indirect WCF and the greater support from direct WCF. Such a proposal is an example of what can be developed when we evaluate and refine an affordance in the light of extant SLA research.
So the prospect of instant feedback may not be as simple as it first appears, at least when it comes to students. But what about teachers? Wouldn’t less grading free up time, allowing them to design better lessons, and stronger curricula?
Grading, not composing: this is of course literally true – the young man Miller observed was not writing an essay of his own. Yet through annotation, recast, and highlighting, teachers in fact construct a portrait of the learner richer and more vivid than whatever report GenAI can provide in the form of statistics and generalizations. We understand more deeply and retain longer whatever knowledge we’ve arrived at through deep processing; the information we acquire quickly without analysis and effort, however, evaporates quickly (Marton & Säljö, Reference Marton and Säljö1976a; Marton & Säljö, Reference Marton and Säljö1976b). We know that when the script is returned to them, our students do not slowly work their way through every tick, cross, or comment, but leap immediately to find out what grade they got. But the work we have done has not been for naught, for through our reading and rereading of each student’s script we come to a deeper awareness of where he or she is, and where he or she needs to go. Consciously or not, we pay special attention to aspects of performance that correspond to the lessons we have taught, and use it to evaluate and refine our lessons. In effect, grading is a core part of the reflective practitioner’s dialogue with the experiences he has designed:
… the designer may take account of the unintended changes he has made in the situation by forming new appreciations and understandings and by making new moves. He shapes the situation, in accordance with his initial appreciation of it, the situation “talks back,” and he responds to the situation’s back-talk. (Schön, Reference Schön1983, p. 79)
Without the hard slog of grading behind the scenes, it is doubtful that a teacher’s classroom performance can be as relevant and urgent as what her students need. For besides the portraits of individual learners, a kind of unconscious tabulation and cross-comparison is at work; the more patient practitioner may well keep a running record of common errors and clear strengths, but it may be more reasonable to expect most teachers beleaguered by fatigue and dogged by deadlines to leave such work to the less conscious, more implicit processes of the mind. Teachers emerge from the process clearer about what to do next and where to steer their students. Perhaps the class has shown that they are ready to move on to a new genre. More likely, there are clear errors that the majority has demonstrated, and longstanding offences that smaller groups have yet to eradicate from their performance. Such conclusions, lodged in the teacher’s mind through the quiet, constant hammer of analysis, comparison, and rereading, drive a teacher to make important decisions. And we can make these decisions with conviction because we have seen the evidence ourselves, experienced the confusion wrought by poor syntax, or chortled at the rich bathos of inappropriate register. We know that we might lose precious curriculum time because of this deviation, but we make the decision nonetheless because it is urgent. Our students need the detour.
Curriculum essentially means a race course. It shares the same root as corridor, current, course, words that evoke straight lines or circles. But the curricula drawn up before the start of the year are really tentative drafts; the actual course of study is inscribed through the decisions made by teachers from day to day. Instead of a smooth untroubled path, the circle of a curriculum might perhaps be more accurately envisioned as a line that is smooth only in places, to be interrupted hither and thither by a recursive series of loops whenever teachers need to go over old material and reinforce old messages. If we do end up offloading a substantial part of our students’ written output to GenAI for grading, what we gain in terms of time for lesson planning may ultimately come to nothing if, without close encounters with our students through grading, we lack the understanding needed to rethink our teaching.
2. Ideas organized the way you like it!
Besides instant feedback, scholars have also discussed the tantalizing prospect of learners offloading the planning process to GenAI. Planning in general involves goal setting, idea generation, and organization (R. T. Kellogg, Reference Kellogg, Levy and Ransdell1996); if attention is a limited resource, investing too much of it in these processes might leave little left for choosing the right word, and constructing a more complex sentence (R. T. Kellogg, Reference Kellogg1994). In fact, trade-offs occur not only between planning and language but within the latter as well: learners have to choose between extending the complexity of their language, maintaining accuracy, or focusing on fluency (Skehan, Reference Skehan1998). These trade-offs make the idea of GenAI as a writing assistant highly attractive. Barrot (Reference Barrot2023, p. 57), for instance, argues that a GenAI tool like ChatGPT can not only suggest “essay topics based on the user’s area of interest” but also create or “transform any outline into a sentence, topic, alphanumeric, or decimal system format” that help learners develop their own outlines.
When idea generation is no longer a challenge, a learner should have sufficient cognitive resources left over to devote to the translating process; in addition, if the ideas proposed by the chatbot are sophisticated enough, they may push the learner to retrieve the linguistic resources necessary for expressing these ideas (Robinson, Reference Robinson and Robinson2001). Given that learners involved in writing are already adopting a more syntactic mode of processing (as compared to learners who are processing information), this could help them expand their linguistic repertoire. If they do not find the linguistic resources necessary for doing so, they might notice the gap in their linguistic system; they could in turn look for additional input, either from more knowledgeable others in the form of peers or human tutors or the chatbot itself (Swain, Reference Swain, Cook and Seidlhofer1995).
However, research from a sociolinguistic perspective suggests that this may not always occur. Instead of deploying the available resources for enhancing the complexity and accuracy of their output, learners may well focus on simply completing their task; ironically, this may be because they treat the task as a real-world experience and focus on communicating meaning (Ortega, Reference Ortega1999; Batstone, Reference Batstone and Ellis2005). In any case, how learners wish to complete a task can never be completely controlled by the task designer (Breen, Reference Breen, Candlin and Murphy1987).
Batstone (Reference Batstone and Ellis2005) suggests that a learning orientation is needed; Skehan (Reference Skehan1996) sees adventurousness and risk-taking as necessary preconditions for the productive use of freed-up resources. In both cases, what a learner needs is a culture or environment that encourages mistakes and “failing forward.” It has been suggested that chatbots might create such preconditions by removing the anxiety learners experience in actual interaction with humans; yet freedom from embarrassment also means the absence of an authentic audience who might give meaning to the act of writing. No matter how warm or encouraging its tone might be, a chatbot is not another human being whose respect we cherish, whose laughter we would like a pun or quotation to elicit. Empirical investigation has consistently shown that students’ writing quality rises when addressing authentic audiences other than teachers (Block & Strachan, Reference Block and Strachan2019; Cohen & Riel, Reference Cohen and Riel1989; Wiggins, Reference Wiggins2009), possibly due to greater motivation to process and store information about such audiences in long-term memory (Magnifico, Reference Magnifico2010). How much motivation, then, can we expect to find in students when the intended audience is not even their human teacher?
Thus the impact of planning is not as direct as one might initially believe. Yet we can refine our use of GenAI to support planning by considering studies on task planning effects (Ellis, Reference Ellis2021; Johnson & Abdi Tabari, Reference Johnson and Abdi Tabari2023). For instance, (Ingley and Pack, Reference Ingley and Pack2023) recommend requesting a chatbot to play the role of an instructor who can help a writer brainstorm for ideas via a dialogue; this echoes teacher-fronted or guided planning, which Skehan and Foster (Reference Foster and Skehan1999) found to be effective in promoting gains across both complexity and accuracy. While their study focused on oral discourse, such findings suggest that guidance in the planning process can help learners overcome trade-off effects, the phenomenon where focusing on one dimension of language performance leads to concomitant dips in other dimensions due to limited attentional resources. Teachers would benefit from empirical investigation on whether a prompt like “You are an English language teacher – engage me in a dialogue to plan the ideas and language I can use for an essay on whether video games help or harm teenagers” could lead to the type of task performance Skehan and Foster, (Reference Foster and Skehan1999) described.
3. Model texts at the drop of a hat!
Much of the enthusiasm that GenAI has inspired can be traced to its remarkable ability to produce what commentators often describe as uncannily humanlike texts. In the heady months after its launch, articles written by ChatGPT quickly went viral once they were published by trusted news providers. This capacity to immediately and almost flawlessly produce texts customized according to users’ specifications has given rise to claims that GenAI can provide model texts that serve as input for language learning (e.g., Kohnke et al., Reference Kohnke, Moorhouse and Zou2023).
Yet if GenAI output appears increasingly humanlike, is it because writing has become steadily more robotic and mechanical? This might well be truer in the field of formal education than other sectors of modern life. For decades now, scholars have debated the value of the five-paragraph essay (Brannon et al., Reference Brannon, Courtney, Urbanski, Woodward, Reynolds, Iannone, Haag, Mach, Manship and Kendrick2008) or the PEEL paragraph (Gibbons, Reference Gibbons2019; McKnight, Reference McKnight2021); the genre approach, originally developed to enlighten second language learners about the organizational structures that first language learners implicitly possess (Cope & Kalantzis, Reference Cope and Kalantzis1993), has in many cases been oversimplified into a “recipe” approach where students are assessed on their ability to reproduce texts with a specific number of paragraphs for each part of an essay (Derewianka, Reference Derewianka2003). This formulaic approach sounds exactly like what Barrot (Reference Barrot2023) has observed: a GenAI composition “typically starts with a definition and a brief history of the concept”, “discusses the effects”, and concludes with a “summary of main points, final thoughts and call to action (p. 4).”
As a teacher, I have often found it difficult to find authentic texts that match the features and organization of a particular genre. The regularities and patterns that analysts report (e.g., Derewianka & Jones, Reference Derewianka and Jones2016) have been distilled from a great number of texts; one would be hard pressed, however, to find any one text that might exemplify all the defining traits of the genre it belongs to. A teacher who curates texts from the wild is obliged to explain to students that these texts represent what writers actually do, and that the genre conventions they learn are only tendencies to bear in mind; consequently, a student comes to develop a greater appreciation for the freedom that writers actually enjoy. In contrast, learners who primarily rely on a steady diet of texts custom-made to fit the specifications of a certain genre will come to believe that effective writing is all clockwork convention and rigorous regularity. And like the serpent swallowing its own tail, students fed a regular diet of GenAI output may well become teachers who see GenAI writing as the only model texts their students need.
Being a product of such a system, I know how difficult it is to resist the instinct to fall back on PEEL and the structure of the five-paragraph essay when attempting an unfamiliar topic. Yet, as Bereiter and Scarmadalia, (Reference Bereiter and Scardamalia1987) have been telling us since 1987, writing is more than simply retelling what we know; it is a transformative process where what we think we know changes through the need to express it. Writers are their first own readers: a gap in our rhetoric – some word that doesn’t feel quite right, a sentence that couldn’t capture what we hoped to express, or paragraphs that just don’t link up very well – sometimes pushes us to reevaluate the thought that eluded our language; going back to reconsider this thought, we might rework the string of propositions that comprise it and emerge with fresh insight, which in turn sets in motion another search for the right word, the best syntax, a better way to organize our text. And so it is through these recursive cycles of self-reading and rewriting that the text takes shape, a text organized according to the stream of ideas it hopes to address, simultaneously familiar enough to fit within the rules of the genre it belongs to and different enough to suit the writer’s purpose and personality. A writer knows when he needs to be pragmatic and efficient, and when he needs to work things out more patiently and reflectively; which path he chooses determines to a large extent how completely he clings to the rules of a particular genre.
Good writing is ludic, a chess match where both reader and writer win if they work together to bend the rules. The rules are the features and conventions of each genre; a writer can choose to operate safely within the space these expectations demarcate or opt for the road less taken. In her essay on AI-generated writing, Morrison (Reference Morrison2023, p. 158) describes how her attempts at subverting reader expectations make her writing seem “almost willfully insubordinate, self-sabotaging.” GenAI output, in contrast, is “a marvel of correct, mild-mannered, balanced, objective prose,” properly organized and studiously formatted. Would it be quixotic to believe that the ability to craft an imperfect text, the kind where one might find a distinctly recognizable voice inseparable from certain inflections of tone and quirks of syntax, might represent a certain advantage in a world where everyone’s writing appears increasingly similar?
Could GenAI be a collaborator in the process of developing a personal voice? As we have seen with feedback and planning, doing so requires a strategic approach that draws on existing scholarship. In their case study of a young academic adept at integrating ChatGPT into every phase of her writing process, Jacob et al. (Reference Jacob, Tate and Warschauer2024) describe how “Kailing” rejects ChatGPT output that does not match her personal style even when it appears eloquent; and how she begins to rely less and less on the chatbot when she detects recurrent patterns in its lexicogrammatical choices. In a word, it is possible to outgrow GenAI. Thus the prototypicality that characterizes GenAI output can be harnessed two ways, depending on where a learner is as a writer: at the early stage when genre knowledge is lacking, its output demonstrates the language features that should be emulated; whereas at the later stages when learners are ready to advertise their identity as unique voices, GenAI output represents the kind of language one should seek to avoid, modify, or subvert.
4. The intelligent use of intelligent things
As I try to stay focused on completing this essay, my son is waiting for me to pause so that he can show me his latest effort at writing, a diary entry in Mandarin chronicling this morning’s exciting events: going to school with his sister, having toast for breakfast, taking the bus. He wrote it sitting next to me, stopping every now and then to ask me how to write a particular word or phrase or whether he had chosen the right punctuation mark. He tries to hide the words from me, insisting that I must not read it before he is done. As I (secretly, slyly) read his composition, certain turns of phrase catch my eye. I notice the chunks he has borrowed and acquired from the texts I’ve been reading to him, the personal favorites that I’ve (secretly, slyly) hoped he will also come to love.
I wonder how this relationship might change if I told him to send his texts to a chatbot instead for instant feedback. I would prefer him to come up with topics that really matter to him, that come from his own observations about the world. Because I care about what he reads, I would rather he learn good writing from texts I myself have enjoyed and seek to emulate. I believe many teachers feel the same way about their students.
More than two decades ago, Warschauer and Healey (Reference Warschauer and Healey1998, p. 67) urged teachers to explore “the intelligent use of CALL” while awaiting the arrival of “intelligent CALL.” While the explosion of literature on GenAI suggests that the wait is over, there are still plenty of questions for intelligent teachers to discuss. If teachers grade less than what they currently do, how might the quality of teaching be affected and how much do learners benefit from GenAI feedback? If learners outsource planning to a chatbot, would the attention thus freed up necessarily be channelled towards language? Would model texts that flawlessly reproduce the features of a certain genre inspire good writing?
Such questions are critical but also difficult to answer alone. Teachers have the experience of classroom work to turn to, and their relationships with students to draw on. But we would fare much better if the research community supports us with the use of carefully constructed studies that seek to more fully unravel the impact of GenAI on the experience of language learning.
Competing Interests
The author declares no competing interests.
Joo Jin Sim is a teacher in Singapore. As a Lead Curriculum Resource Development Specialist in the Ministry of Education, he guides the translation of research into multimedia resources, digital tools, and curricular materials that support the teaching of English Language and Literature in English. His research interests lie in the field of second language acquisition, particularly the psychology of language learning and instructional approaches like task-based language teaching and genre-based pedagogy. Joo Jin is currently pursuing a Ph.D. in the National Institute of Education, Nanyang Technological University. For his doctoral study, he is developing a novel task sequencing model based on the Limited Attentional Capacity approach and examining its impact on the written task performance of Singaporean adolescent learners.