1. Introduction
Over the past several decades, there has been a growing critique calling for the need for a more critical lens to acknowledge the power dimensions in language testing (Lynch, Reference Lynch2001; McNamara & Knoch, Reference McNamara and Knoch2019; McNamara & Roever, Reference McNamara and Roever2006; McNamara & Ryan, Reference McNamara and Ryan2011; Roever & Wigglesworth, Reference Roever and Wigglesworth2019; Schissel & Khan, Reference Schissel and Khan2021; Shohamy, Reference Shohamy1998, Reference Shohamy2001). Scholars cite the responsibility to routinely establish fairness, to center justice, and uncover injustice stemming from consequences of assessments (Randall, Reference Randall2021). This centering requires acknowledging the socio-political contexts in which assessment exists (Schissel & Khan, Reference Schissel and Khan2021). For instance, language assessments are used in making decisions about employment, citizenship, education access, and salary, all of which decidedly impact the well-being of people and have material consequences for society. This creates an ethical imperative for our field.
Addressing the issue of equity in assessment research is critical as research informs the applied work of test development and use. Inadequate discussion of equity can lead to harmful effects including limited access and opportunities for language learners. Specializations within the field, such as writing assessment, require critical interrogation as well (Inoue, Reference Inoue2015). With the premise that research should inform our practice, the focus of this State-of-the-Art review is to survey research in writing assessment and analyze it with a framework of fairness, justice, and criticality. The concerns of fairness and justice can be viewed through criticality, a guiding critical lens, to study and illuminate equity/inequity. Our review will narrow to focus on research that centers writing assessment in teaching and learning in design, analysis, findings, or implications (i.e. not reviewing writing assessment in other contexts such as employment or citizenship). This approach views assessment as part of pedagogy and a tool in developing writing.
In this article, we present themes identified in reviewing a sample of writing assessment research in teaching and learning over the past 25 years and argue for layering a framework of fairness, justice, and criticality over future work in this area to consider how equity has been researched and could be further interrogated. The review speaks to a need to update past reviews of writing assessment research (Hamp-Lyons, Reference Hamp-Lyons2002; Yancey, Reference Yancey1999). Reviewing knowledge gained from these existing themes in scholarship serves as a starting point to consider where there is further work needed to connect these themes to issues of equity. We propose that writing assessment research faces issues of fairness and justice more intentionally.
2. A framework for fairness, justice, and criticality in language assessment
In this review, we consider a framework that comprises fairness, justice, and criticality as part of a socially conscious approach to writing assessment research. We propose that innovation in writing assessment – through new constructs, methods, and perspectives – will support the field in addressing concerns of fairness and justice. Figure 1 outlines the critical framework applied in this State-of-the-Art review. Assessment context is the foundation on which the interlaced concepts of justice and fairness reside. The framework includes validity, while not the primary focus of our critique, as a dominant quality feature in assessment and a core element that intersects with fairness and justice. Validity includes consequences of an assessment (Messick, Reference Messick and Linn1989) and is essential in work to examine assumptions or dominant narratives (Randall, Reference Randall2021). We propose a framework that asks critical questions (a critical lens) about fairness and justice that intersect with validity. We will briefly define each of these concepts in relation to writing assessment.

Figure 1. Framework for critical review.
Fairness can be defined as the somewhat procedural considerations in language assessment that impact the test’s quality (i.e. construct relevance) and the experience of test takers (McNamara & Ryan, Reference McNamara and Ryan2011). Fairness is discussed with high stakes standardized assessments, but it is equally important in assessments used in teaching and learning. The linkage between validity and fairness is not new (Messick, Reference Messick and Linn1989), and the understanding of fairness in language testing has evolved greatly, becoming more nuanced (Kunnan, Reference Kunnan2014; McNamara & Knoch, Reference McNamara and Knoch2019; Poe & Elliot, Reference Poe and Elliot2019). Of the three terms (fairness, justice, criticality), fairness has received the most attention in language assessment but our understanding is also decidedly incomplete. While some have held a broader view of fairness, the definition we are referencing situates fairness closely in the design and delivery of an assessment (McNamara & Ryan, Reference McNamara and Ryan2011).
Poe and Elliot (Reference Poe and Elliot2019) reviewed 73 articles on assessing writing to investigate how they included fairness. The researchers identified trends from these articles, namely discussions about minimizing bias, establishing validity, recognizing social contexts, providing legal definitions, and determining ethical responsibility. Their analysis of the representation of these trends over time revealed differing degrees of attention. The technical aspects of fairness – bias and validity – have been increasingly brought up in publications, while social aspects, context, and ethics are mentioned less often. Despite this respectable number of articles, Poe and Elliot felt the scholarship has not ‘resulted in shared taxonomies across disciplinary orientations, led to a deepening of theoretical conceptualizations of fairness or brought about innovative classroom approaches’ (p. 15). The field needs to remedy this gap and address social and ethical aspects of fairness more deeply.
Justice can be defined as the socio-political and consequential implications of testing. In some definitions, justice is combined under fairness in relation to consequences (Kunnan, Reference Kunnan2014); however, we consider justice to be overlapping with but separate from fairness (McNamara & Ryan, Reference McNamara and Ryan2011). Justice requires us to examine inequities in the use of assessments and their impact on individuals, communities, and society. Randall (Reference Randall2021) speaks to the need to apply a justice-oriented framework of anti-racism in rethinking constructs in educational measurement. In doing so, we interrogate the purpose of an assessment in terms of justice as well as reflect on our positionality as assessment users/developers/researchers/teachers. A justice-oriented approach seeks the input and understanding of stakeholders in the assessment process.
Validity has been given considerable attention in writing assessment over the past 25 years or more (White, Reference White2019) and can be described as the quality of a test’s content coverage and relevance to the construct being assessed. According to Kane, ‘Validity is not the property of a test. Rather, it is a property of the proposed interpretations and uses of test scores’ (Kane, Reference Kane2013, p. 3). In education, this perspective considers validity in assessments as supporting the assessment-learning cycle rather than solely to judge learning or learners (Shepard, Reference Shepard2016). It also subsumes consequences of tests, which directs writing assessment research to engage with fairness and justice.
Relationships between these key concerns in writing assessment have been discussed in the field. While different views have appeared, as stated previously, we are following the work of McNamara and Ryan (Reference McNamara and Ryan2011), who have drawn upon Messick (Reference Messick and Linn1989) and point to his ‘Facets of Validity’ as key to fairness, citing construct validity and test use (relevant utility), while value implications and social consequences align with justice. McNamara and Ryan (Reference McNamara and Ryan2011) conceive of justice as an umbrella encompassing fairness, values, and consequences. That said, the relationship with fairness, and arguably, with validity, is reciprocal: ‘Concerns for fairness [...] have the paradoxical potential to cloud issues of justice’ (McNamara & Ryan, Reference McNamara and Ryan2011, p. 175). In assessment, validity has been a major linchpin for research and practice; however, fairness has recently risen as an equal partner to validity. Fairness is a necessary criterion for validity. For example, when assessments advantage certain groups over others, this constitutes construct irrelevance – a concern of validity discussed by Randall (Reference Randall2021) – but also a threat to fairness.
Justice has received limited attention to the point that Schissel and Khan (Reference Schissel and Khan2021) have challenged the field:
The current state of (limited) engagement with social issues does not make the issues disappear nor diminish their significance, but rather simply reflects an ignorance, and perhaps even cowardice, and works to limit, stifle, and police a field at precisely the time when it could be developing in other ways. It is time to work actively against exclusionary disciplinary approaches (p. 2).
To interrogate issues of power, assessment context, as the socio-political milieu, becomes a necessary consideration. Social/socio-political factors might also include dominance of the language of an assessment; cultural knowledge or practices embedded in an assessment that are unfamiliar to test takers; test purposes such as gatekeeping; or assessment tied to advancement and opportunity. For example, assessments used in citizenship or employment requirements will garner somewhat different critical questions than assessments in educational settings. Context also determines the stakeholders impacted by an assessment and the decisions made therein. In this review, our focus is on writing assessments used in education, narrowing the critique to research related to teaching and learning writing. While the many domains in which writing assessments are used deserve attention, our experience has been primarily as language educators, and thus this is the focus we feel best qualified to critique. Further, this focus is most relevant to Language Teaching.
Criticality entreats us to ask questions about fairness and justice and about context and power. It provides a process through a critical lens to challenge us in enacting a justice-oriented approach. Criticality leads scholars to question established and invisible practices and systems that perpetuate inequity in our society, communities, and classrooms (De Costa, Reference De Costa2018; Kubota & Miller, Reference Kubota and Miller2017; Soto, Reference Soto2022). It sheds light on power, dominance, and oppression. Criticality leads researchers to ask questions such as, ‘Whose standards are these?’, ‘Who is excluded?’, and ‘Does this uphold hegemony and marginalization?’ To do this work, researchers need to consider their own positionality and reflect on their experiences, underlying assumptions, and understanding of racism, oppression, inequity, and dominant norms that underlie their work, beliefs, and assumptions (Brown, Reference Brown2013; Sealey-Ruiz, Reference Sealey-Ruiz2021). Criticality is not an answer but a process that changes our thinking and our vision. While taking a critical view can create dissonance and discomfort, such ‘crises’ (Kumashiro, Reference Kumashiro2000) are how we learn, unlearn, and do better. Therefore, it should have a central role in research. We need to use a critical lens in research to understand and to take responsibility for inequity in writing assessment in teaching and learning.
In our review, we are proposing fairness and justice be informed and transformed by criticality in writing assessment. In particular, we advocate for more work to consider the social/socio-political context of writing assessment in teaching and learning. Adopting a perspective of criticality interrogates assumptions and contexts for power and equity (or inequity). This call to language testing is not new. The need for this work resides in the ubiquity of assessments in decision-making and the impact of those decisions. Critical language testing has had a voice in the field since the 1990s, through work by Shohamy (Reference Shohamy1998, Reference Shohamy2001), Lynch (Reference Lynch2001), McNamara & Roever (Reference McNamara and Roever2006), and others. In the words of Shohamy, ‘Critical language testing assumes that the act of testing is not neutral. Rather, it is both a product and an agent of cultural, social, political, educational, and ideological agendas that shape the lives of individual participants, teachers, and learners’ (Reference Shohamy1998, p. 332).
3. Writing assessment research in teaching and learning
To review writing assessment critically, context is needed. Asking critical questions requires understanding the purpose and uses of assessment and who is impacted. Thus, for this review, we have narrowed the focus to writing assessment research related to education, specifically, teaching and learning. While this still casts a wide net, centering our review on assessment as part of pedagogy means less attention to assessments used in citizenship requirements, admissions processes, or employment/salary decisions. At one time, language testing research and theory-building focused largely on large scale assessments, due to the high-stakes decisions made using these tests such as employment, education access, and citizenship (Inbar-Lourie, Reference Inbar-Lourie2008). However, assessment’s role in everyday learning in classrooms is equally important in facilitating access and opportunities to learn. Our focus is on inquiry into assessment for instructional purposes, that is, practices that are used to support learning enacted by teachers in educational contexts. We reviewed research on classroom-based assessment and also drew on studies with larger scale assessment that included findings and implications useful to the teaching and learning of writing.
Discussions of fairness, consequences, and ethics have involved large-scale standardized testing due to the high stakes impact and policy dimensions of these assessments. However, assessments in classrooms should also be considered critically as they can affect the access, trajectory, and opportunities of learners. Potentially, since these assessments are subject to less scrutiny and generally do not adhere to established quality checks around issues of equity, they may cause greater harm. Asking critical questions in research should further the expectations for fairness and justice for learners in these contexts.
Scholars in the field of writing assessment have been increasingly interested in what kinds of assessments promote learning, which methods are effective and efficient in classroom assessment, and how to productively prepare teachers to use assessments (Crusan et al., Reference Crusan, Plakans and Gebril2016). Most recently, the conceptualization of the student in the center of assessment in learning has been detailed by the learner-oriented assessment (Gebril & Brown, Reference Gebril and Brown2019; Jones & Saville, Reference Jones and Saville2016; Turner & Purpura, Reference Turner, Purpura, Tsagari and Baneerjee2016). In a review of 25 years of research in writing assessment, White (Reference White2019) described the shifts in thinking about writing assessment and classroom applications: ‘While different community positions remain distinct – writing teachers are unlikely to love multiple-choice tests and test professionals will remain distrustful of formative teacher evaluations – the hardened positions of yesterday have become more flexible’ (p. 42).
In this State-of-the-Art article, we present a review of published research to understand themes in assessing the skill of writing. We explore its interface with teaching and learning and put forward areas for more work by applying a framework for critique related to fairness, justice, and criticality (see Fig. 1). This review leads to the discussion of potential future innovation and challenges for writing assessment in language education.
4. Methodology
Our approach to this review was multifaceted, aiming to provide both breadth and depth. The steps in our synthesis of the research are briefly listed below:
• Collect and vet research articles across specialized high-impact journals in L2 writing, assessment, and language education
• Conduct initial review of articles for patterns and themes
• Evaluate connections to teaching and learning to narrow the pool
• Explore emergent topics within these areas of writing assessment
• Consider implications from this review for pedagogy
• Apply framework (Fig. 1) for critical review of themes for future directions.
Our search began with a range of journal issues published between 2009 and 2025. We entered the keywords ‘writing’ or ‘writing assessment’ in multiple databases to extract and form a unified body of relevant studies. This initial search yielded a great many articles, beyond the scope of feasibility for the article; thus, we decided to narrow the journal inclusion base. The review focused on major flagship journals in research on language assessment, writing, and second language learning. The journals range in impact factor (JIF) from 1.4 to 5.0 and are listed in the Social Sciences Citation Index (SSCI, Web of Science). These criteria were used to assure that the research reviewed has undergone a competitive peer review process. The journals from which studies were sourced include Language Teaching, TESOL Quarterly, Journal of Second Language Writing, Assessing Writing, Language Assessment Quarterly, Language Testing, Journal of English for Academic Purposes, System, and Journal of English for Specific Purposes. However, we recognize that there are many other journals, books, and book chapters that include important and diverse research perspectives. For this study, we kept the scope narrow to allow for depth but feel that further study that casts a wider net would be fruitful. These journals were searched for articles using the keywords ‘writing + assessment’, which led to an initial sample of over 800 articles. Table 1 presents the count of articles reviewed in each of the seven journals.
Table 1. Articles reviewed on writing assessment in teaching and learning research

With these targeted journals in the areas of language teaching, language learning, writing, and assessment, we conducted a comparative process to review for emergent themes by grouping similar journals. This comparative approach allowed us to consider broad themes in several areas. In the first round, we searched in Language Teaching, System, and TESOL Quarterly, journals known to publish on a wide range of topics related to language teaching and learning. For the next stage of the search, we considered journals that published articles focusing primarily on L2 writing or L2 writing assessment: Journal of Second Language Writing, Journal of English for Academic Purposes, Journal of English for Specific Purposes, and Assessing Writing. Lastly, we shifted our attention to journals that publish articles exclusively in language assessment: Language Assessment Quarterly and Language Testing. Following this review, we sampled two journals in educational assessment and measurement, Educational Assessment and Assessment in Education, but neither published any articles that focused on second language writing.
Through this step-by-step process, we reviewed research studies to generate an initial list of themes in writing assessment research. These themes were rather rough, with both overlap and breadth. The following was the list of themes: (1) writing assessment in general, (2) rating and scoring, (3) features of writing performance, (4) process and strategies, (5) integrated assessment, (6) assessment, teaching, and learning, (7) teachers’ perspectives, and (8) learners’ perspectives. To make the review and critique more meaningful, we narrowed the pool of research articles and themes further to address context. Thus, we scored all the articles in terms of relatedness to teaching and learning.
To ensure we were selecting the most relevant research for the context of interest, we developed a process to rate the studies by their relevance to teaching and learning. Each reviewer read a set of 20 articles independently using a simple three-point scale to score relevance. After this initial rating, we compared scores to discuss the differences between levels (i.e. what was a 1 versus a 2 or 3) as concretely as possible, which led to further refinement. We also assigned an anchor example for each score level for our reference and calibration. Then, each reviewer gave each article a score on the scale (shown below) to reflect how closely the content connected to teaching and learning either in its research design, analysis, or implications.Footnote 1
Each article will be given a score on a three-point scale to reflect how closely it connects L2 teaching and learning with assessment, either in its research design, analysis, or implications of the findings.
(1) not related or only mentioned in implications or rationale
Example: Ling (Reference Ling2017). Are TOEFL iBT® writing test scores related to keyboard type? A survey of keyboard-related practices at testing centers. Assessing Writing, 31, 1–12.
(2) somewhat related, appearing in at least one area (design, analysis, or discussion)
Example: Huang (Reference Huang2012). Using generalizability theory to examine the accuracy and validity of large-scale ESL writing assessment. Assessing Writing, 17(3), 123–139.
(3) assessment is highly connected with teaching and learning throughout the article’s framing and in the research design and analysis
Example: Vincelette and Bostic (Reference Vincelette and Bostic2013). Show and tell: Student and instructor perceptions of screencast assessment. Assessing Writing, 18(4), 257–277.
From this rating, we selected research articles that were strongly connected (scored 3) or somewhat connected (scored 2) to teaching and learning, and we excluded articles less connected to these areas of interest (scored 1). In the reference section, articles included in the final review are marked with asterisks (*).
We read and analyzed this narrowed pool of 69 articles, identifying emergent themes in the studies and reviewing their findings with a critical lens. Coding strategies from qualitative research (Saldaña, Reference Saldaña2011) were used to create initial and axial codes. We conducted the initial coding individually, before discussing codes, patterns, and potential themes together. Some themes stayed the same as the initial round of comparative coding. Others were combined into different themes. Frequently, an article fell across themes, and we discussed how to distinguish themes and articulate the core concepts of articles.
In our review of research since 2009, five major themes emerged in published writing assessment research: (1) features of writing performance; (2) rating and scoring; (3) integrated writing assessment; (4) teachers’ and learners’ knowledge, beliefs, and perspectives; and (5) corrective feedback. These themes emerged as central overarching areas of research attention during this time period. Sub-themes within each of these major themes emerged and will be presented in the next section.
These five areas have generated a substantive amount of attention n answering empirical and interpretive questions, resulting in foundational knowledge. A need for more attention to fairness and justice became apparent, however, requiring critical approaches as a direction to elevate this work. A final stage of analysis led us to read the review of research in each theme with a critical lens to consider where authors addressed fairness, justice, or criticality, as drawn from our framework (Fig. 1). This critical review of the research was a second layer of analysis to considering the major themes through our framework and implications for future research.
Validity and context are important parts of the framework (Fig. 1); however, they do not emerge as central findings in the critique for different reasons. Writing assessment research has a robust tradition of research questions related to validity, and thus it appears throughout the themes described in the review. However, interest has not focused on how validation has or could grapple with the challenges of fairness and justice in writing assessment using a critical lens. Context, in contrast to validity, has not been a central feature of writing assessment research and thus does not emerge as strongly in the identified themes, yet it impacts fairness and justice in important ways. We recognize a degree of interpretation in this process and acknowledge this organizational scheme is filtered through our own perspectives of writing assessment and research.
5. Themes in writing assessment in teaching and learning
In this section, a synthesis of research organized by themes is reviewed to highlight recent research in writing assessment related to teaching and learning. Using our framework for critical review, each theme is followed by a discussion of (1) whether or how the research addresses fairness and justice in assessment and (2) where there is space or potential for a critical lens in this work.
5.1. Features of writing performance
Researchers in L2 writing assessment have made major efforts to understand how specific textual characteristics in L2 writing relate to test scores and second language proficiency. Some of the most investigated language features include complexity, accuracy, lexis, and fluency (CALF) (Adams et al., Reference Adams, Nik Mohd Alwi and Newton2015; Plakans et al., Reference Plakans, Gebril and Bilki2019; Wolfe-Quintero et al., Reference Wolfe-Quintero, Inagaki and Kim1998). However, L2 writing assessment scholarship has also expanded to incorporate other measures to investigate the quality of writing, rather than relying exclusively on CALF. For example, Taguchi et al. (Reference Taguchi, Crawford and Wetzel2013) examined college-level ESL learners’ argumentative essays, focusing in particular on language use and content. Language use was operationalized as ‘facility in the use of effective, complex constructions, and few or no grammatical errors’ (p. 423), and content referred narrowly to the accurate representation and effective use of source text. While both were found to be indicative of higher essay scores, the content measure contributed more to language use than total score variance. The finding that content accounted for more variance in total score than language use suggests that content may not be merely a surface-level construct. Instead, it is possible that content serves as a higher-order construct that shapes how language is used. More recently, Sato (Reference Sato2024) clarified the construct of content as crucial yet absent information in the context of content and language integrated learning, highlighting the inextricable nature of language and content.
Another framework for writing performance features was proposed by Kuiken and Vedder (Reference Kuiken and Vedder2017), who highlighted the importance of assessing functional adequacy in L2 writing. Informed by Grice (Reference Grice, Cole and Morgan1975)’s maxims of quantity, relevance, manner, and quality, the researchers developed a rating scale scoring the dimensions of content, task requirements, comprehensibility, coherence, and cohesion. Based on essays written by L2 Dutch and Italian learners, the researchers subjected the scale to correlations and reliability checks. Intra-class correlations are estimated to range from acceptable to excellent among four different dimensions. Furthermore, the raters’ judgments of the writings composed by the same participants were consistently scored, with reliability estimates calculated between .455 (task requirements for Italian) and .877 (comprehensibility for Dutch). This study contributed to an understanding that L2 writing is a multidimensional construct that requires the orchestration of not only linguistic but also pragmatic resources to communicate and make meaning possible.
Recent research applying CALF measures aims to examine how such measures vary as a function of differing conditions of writing. For example, seeking to examine the relationship between task complexity and CALF measures, Frear and Bitchener (Reference Frear and Bitchener2015) found that an increase in reasoning demands and the number of task complexity elements was facilitative in eliciting better lexical complexity and syntactic complexity. Adams et al. (Reference Adams, Nik Mohd Alwi and Newton2015) analyzed the performance of 96 students on engineering simulation tasks through text chat. They manipulated two different task conditions: the amount of direction to complete a task and the language support required. The findings suggested that more complex tasks push learners to produce more accurate language. A study by Ong and Zhang (Reference Ong and Zhang2010) examined the effects of providing planning time, ideas and macro-structure, and draft availability. Learners given planning time, ideas, and a macro-structure were found to produce more complex language with more words (i.e. fluency). In the context of multi-skill assessment involving writing based on reading and listening passages, Plakans et al. (Reference Plakans, Gebril and Bilki2019) found that fluency, operationalized as total word count, was the strongest predictor of integrated writing proficiency. Morphological accuracy contributed more to score variance than either syntactic or grammatical accuracy. Although significant, the contribution of complexity, operationalized as mean length of T-unit to score variance, was not as strong as fluency and accuracy. To summarize, CALF is informative but not a fixed linguistic system of learners/test taker’s language, as research has shown; it is subject to variation due to a host of reasons, such as those reviewed here.
While research on the measurement of writing features in L2 writing performances over the past decade has mostly been carried out using the CALF framework, a growing number of research studies draw on other features that speak to capabilities in higher-order thinking, including authorial voice, critical thinking, and source use (Behizadeh & Engelhard, Reference Behizadeh and Engelhard2014; C. G. Zhao, Reference Zhao2013; Kim & Crossley, Reference Kim and Crossley2018). L2 writing appears to be a multidimensional language skill that subsumes and interacts with various types of capabilities beyond lexicogrammatical features.
Over the past decade, research on writing features in assessment has not specifically centered on classrooms; however, the composite of findings holds implications for teaching and learning. The research suggests that instruction should address the multidimensional nature of second language writing. Writing assessment used in teaching should provide students and instructors with insights, not just on the accuracy, fluency, or complexity of students’ writing, but also into how students engage with content, they meet the demands of the task, and features such as coherence and cohesion that are used to increase the comprehensibility of their writing. Another implication is how the design of assessment tasks impacts the features in the writing performance of students. For example, careful consideration is needed for the complexity of tasks as well as what supports are provided, including planning time and source material.
5.1.1. Fairness, justice, and criticality in writing features research
The current approach in researching characteristics of writing at different proficiency levels needs a broad critical lens to uncover embedded norms and values. From a validity standpoint, we need to interrogate the constructs and standards that have highlighted these features for attention. Shohamy stated in ‘Critical language testing and beyond’ that we need to specify ‘whose knowledge the tests are based’ on (Reference Shohamy1998, p. 333). A critical approach requires recognition and transparency on where the characteristics, such as CALF, come from, as well as the interpretation of quality based on specific measures. What defines proficiency and proficiency levels in English or any language on which L2 writing is assessed? On what variety of English are measures of accuracy determined? Why is fluency a predictor when conciseness is also a value of communication?
Questions related to justice need to be asked about power, equity, and fairness in relation to how we characterize writing performances in assessments. These are big questions, each with an individual research agenda. From a multilingual rather than a monolingual perspective, CALF troubles this tripartite classification and could lead to more equitable measures. It is promising that scholars who are beginning to look at sociocultural features of writing are adding depth to the traditional features that have been the focus of L2 writing (Behizadeh & Engelhard, Reference Behizadeh and Engelhard2014; C. G. Zhao, Reference Zhao2013). These new areas for research are shifting focus from accuracy and length to communicative impact and content (Kim & Crossley, Reference Kim and Crossley2018; Kuiken & Vedder, Reference Kuiken and Vedder2017). These alternative targets allow, or could allow, a more just consideration of language varieties in writing assessment. The field perpetuates systems that subjugate multilingual writers if we ignore the issue of dominant norms in ‘features’ or ‘proficiency levels’ in our research.
5.2. Rating and scoring L2 writing
As L2 writing assessment has shifted from indirect measurement of writing knowledge to direct measurement of writing performance (Crusan, Reference Crusan2010), challenges of rating have become a source of concern for researchers and practitioners. This work focuses in on rating and scoring of performances in writing assessments. In the context of L2 writing assessment, research has shown that variables such as topic, discourse mode, genre, and time limits can pose a threat to reliable rating, thus interfering with accurately reflecting the student’s proficiency and the potential for providing meaningful use of scores (Schoonen, Reference Schoonen2005).
The rating scale, a critical tool that informs ‘decisions and inferences about writers’ (Weigle, Reference Weigle2002, p. 108), plays a central role in L2 writing assessment. Its systematic and rigorous nature aligns with the broader goals of educational research, which seeks to generate evidence-based findings grounded in transparent evaluative practices. Research reviewed has addressed rating scales primarily with regard to their development, adaptation, and validation. The recognition that writing is a multifaceted construct that encompasses not only linguistic but sociolinguistic and pragmatic aspects of language has led researchers to develop rating scales incorporating complex nuanced constructs. For example, attempts have been made to develop analytic rubrics measuring authorial voice in L2 argumentative writing (C. G. Zhao, Reference Zhao2013), authenticity (Behizadeh & Engelhard, Reference Behizadeh and Engelhard2014), and critical thinking (Saxton et al., Reference Saxton, Belanger and Becker2012). Some studies have introduced scoring rubrics intended for specific modes of writing, such as a rubric for electronic portfolios of L2 writing (Lallmamode et al., Reference Lallmamode, Mat Daud and Abu Kassim2016) and for reading-to-write tasks (Chan et al., Reference Chan, Inoue and Taylor2015). Studies that report newly developed scales also discuss their validation procedures (Ramineni, Reference Ramineni2013). For instance, findings of Saxton et al. (Reference Saxton, Belanger and Becker2012) specifically reference inter- and intra-rater reliability as a measure to ensure consistency, which is particularly hard to achieve for a newly implemented rubric.
A well-established rubric, if put to use in the local context, might not work well in distinguishing proficiency levels specific to the students or curriculum. The need for revising a rating scale in accordance with the demands of context is highlighted in Janssen et al. (Reference Janssen, Meier and Trace2015) and Banerjee et al. (Reference Banerjee, Yan, Chapman and Elliott2015). These studies emphasize the importance of following both theoretical and empirical criteria in constructing a rating scale appropriate for specific contexts.
In L2 writing assessment, human raters have long been recognized as potentially impacting reliability due to differential application of the rating scale and individual interpretations. Barkaoui (Reference Barkaoui2010) found raters paid selective attention to only language, rhetoric, and ideas when using a holistic scale, whereas an analytic scale led them to attend to all the listed criteria. However, a later study (Winke & Lim, Reference Winke and Lim2015) reported that raters’ attention was not equally distributed across categories using an analytic scale; in fact, it was impacted by the sequence in which the criteria were organized in the rubric.
While the notion that rater training is widely accepted as playing a pivotal role in securing reliable ratings of essays (Weigle, Reference Weigle1998), several research studies carried out over the past decade yielded mixed evidence as to the effectiveness of rater training. Implementing a short-term training program, Attali (Reference Attali2016) compared the psychometric properties pertaining to new and experienced raters in assessing writing, such as reliability derived by G-theory and confirmatory factor analysis. Based on the findings that there were only small differences between the two rater groups in terms of scoring consistency and validity, he suggested that rating is influenced by ‘learning that occurs during initial training and abilities that are acquired prior to training’ (p. 107), rather than rater experience per se. The implications of this research study are powerful, such that rater reliability may be more contingent upon the cognitive framing established during initial training, rather than the gradual accumulation of expertise. A similar finding was reported in a longitudinal study by Lim (Reference Lim2011), where the novice raters who initially displayed fluctuation in rating quality fairly quickly caught up with their experienced counterparts. The effectiveness of individualized feedback to raters, a more specific form of rater training, has been found to be inconclusive in the literature. A Rasch analysis carried out by Knoch (Reference Knoch2011) found training ineffective in bringing about change to reduce bias or variability of individual raters.
Several large-scale language tests have embraced computerized scoring to achieve higher reliability while reducing the high costs associated with human rating. Recent research, however, has focused on its limitations. One inherent vulnerability of the rating software is that it is not sensitive to detecting construct-irrelevant strategies, commonly referred to as test-wiseness strategies. In Bejar et al. (Reference Bejar, Flor, Futagi and Ramineni2014), examinees who substituted a portion of their essay with less frequent and longer lexical choices were given a higher score by the software e-rater than essays without such manipulation. Also, the challenges associated with scoring content or higher-order thinking that goes into the composition of essays remain unresolved in automated scoring (Attali et al., Reference Attali, Lewis and Steier2013; McCurry, Reference McCurry2010). In addition, automated essay scoring by way of artificial intelligence (AI), machine learning (ML), and natural language processing (NLP) is gaining traction in local contexts of L2 writing assessment. Hannah et al. (Reference Hannah, Jang, Shah and Gupta2023) developed an ML-supported automated system that measures young students’ writing (Grades 3 to 6) on the features of task fulfillment, organization and coherence, and vocabulary and expression. Their findings indicated (1) that the reliability between human raters and the automated system was higher than that comparing only human raters, and (2) that the agreement of human-AES ratings increased with grade. A study by Sickinger et al. (Reference Sickinger, Brunfaut and Pill2025) indicated that comparative judgment, which draws on the benefits of both automated scoring and human rating, is high in reliability for both holistic and analytic rating of young learners’ essays. Potter et al. (Reference Potter, Shortt, Goldshtein and Roscoe2025) leveraged NLP to operationalize the lexical, syntactic, cohesive, and rhetorical aspects of academic language as produced in source-based argumentative writing by tenth-grade students. Local cohesion – semantic connection between sentences – was not found to have a significant effect on holistic score of source-based writing, whereas source cohesion – semantic cohesion with the source text without verbatim copying – was estimated to have the largest effect on the holistic quality.
An increasing number of L2 writing assessment studies have employed a statistical method called generalizability theory (G-theory) to provide a nuanced understanding about reliability. A G-theory study carried out by Gebril (Reference Gebril2010) found that the scoring reliability of an integrated reading-into-writing task is as high as that of an independent writing task. With this finding, the researcher argued for increased use of integrated performance assessment for academic purposes. Also, in the context of integrated writing assessment, Ohta et al. (Reference Ohta, Plakans and Gebril2018) examined the comparability of the holistic and analytic scales. They found that the analytic scale yields a higher degree of reliability, with one of the sub-scales, source use, identified as the most consistent and reliable. Han (Reference Han2019) carried out a G-theory analysis to examine the extent to which task topic, rater, and direction of interpretation (e.g. English to Chinese vs Chinese to English) are subject to measurement error in a language interpretation assessment. Bouwer et al. (Reference Bouwer, Béguin, Sanders and van den Bergh2015) incorporated genre as a potential source of variance and reported that the generalizability of writing scores is contingent on the genre selected in the assessment. Huang (Reference Huang2012) used G-theory in estimating the reliability of writing scores obtained by second language and first language English-speaking students and found systematic differences in the scores depending on the language profile of the test takers. Noting less precision and reliability for scores assigned to the L2 test takers, Huang (Reference Huang2012) directed attention to potential fairness issues inherent in assessing multilingual test takers. Gtheory has potential as practical guidance for test developers and teachers in the classroom interested in understanding reliability and the impact of raters, task type, test takers, and other aspects of performance-based writing assessments.
Our discussion on rating and scoring has been one of securing validity and reliability. Research in the past decade has provided an increasing number of rubrics that reflect the multidimensionality of L2 writing by incorporating aspects such as voice and critical thinking. Those rubrics have great potential for classroom use. Ecological validity should be of utmost concern to teachers who would like to use a well-known, established rubric. Without consideration of contextual idiosyncrasies and limitations, a rubric, however well established for its sound psychometric properties, might not function well in different contexts. Thus, it may be even more important to recognize that consistency when scoring writing can be impacted by the format of a rubric (analytic vs holistic) and by experience with the rubric and level of training in using it. These findings support practices such as teachers scoring student writing together across courses and ongoing professional development on using rubrics and rating writing. Also, reliability has been addressed in terms of interrater consistency and use of generalizability theory as means to identify the amounts of variance supplied by multiple sources. We suggest that, although using automated scoring might help decrease variance in grading, it still lacks a mechanism to examine higher-order skills, such as critical thinking, content representation, and organization.
5.2.1. Fairness, justice, and criticality in rating and scoring research
In research on rating and scoring, attention to reliability has the potential to support fairness, as it should create equal opportunities to all test takers if everyone is scored in the same way. However, a definition of language proficiency is embedded in our scoring scales to establish levels of writing proficiency. These definitions and ‘standards’ lead to critical questions, including: Whose language varieties or language development patterns are these levels based on? Do they represent or prioritize one community’s language variety over others? Researchers have investigated expanding rubrics to include criteria such as authorial voice (C. G. Zhao, Reference Zhao2013) or critical thinking (Saxton et al., Reference Saxton, Belanger and Becker2012), which may push raters beyond traditional language standards; yet these too are not neutral and are imbued with cultural and linguistic values. Research on rating and scoring has not closely scrutinized the potential for racial and linguistic bias and discrimination underlying scoring or rating scales. Scoring scales have been seen as devoid of human influence, largely taken as static documents, but this assumption reveals an inherent misunderstanding of how scales, standards, and rubrics are designed and enacted. Studies of rubrics-in-context provide a pathway to understand the impact of rubrics and the benefits of being dynamic and responsive (Banerjee et al., Reference Banerjee, Yan, Chapman and Elliott2015; Janssen et al., Reference Janssen, Meier and Trace2015).
Standards and criteria for evaluation are related to another issue in need of more criticality: the bias of raters and teachers. Shohamy points to this issue in discussing the interpretation that comes with scoring, where critical language testing ‘considers the meaning of language test scores, the degree to which they are prescriptive, final and absolute, and the extent to which they are open for discussion and interpretation’ (Reference Shohamy1998, p. 333). Studies have found substantial evidence for rater bias in writing assessment, even with training to minimize its influence. It is a recognized weakness of performance assessments (Barkaoui, Reference Barkaoui2010; Winke & Lim, Reference Winke and Lim2015) although researchers seek to mitigate this problem (Attali, Reference Attali2016; Knoch, Reference Knoch2011). A critical lens has not been used in deeply interrogating and understanding the power raters have in affecting test outcomes and the potential inequities created by the bias that raters bring to assessment.
One might assume that with automated scoring of writing performances, which has received increased use and attention over the past 20 years (e.g. Attali et al., Reference Attali, Lewis and Steier2013; Bejar et al., Reference Bejar, Flor, Futagi and Ramineni2014), biases and discrimination would be minimized. Not so. The programming involved in establishing automated scoring depends on both scoring scale characteristics (standards) as well as human scorers to train and calibrate the system. The same biases held by human raters as well as dominant ‘native’ norms may be programmed into the automated scoring. Automated scoring may only mechanize inequity issues that reside in scoring scales and human raters.
The bias teachers bring to reading student work is an area that would benefit from a critical approach. Ferris et al. (Reference Ferris, Brown, Liu and Stine2011) and Ferris (Reference Ferris2014) show potential in looking at teachers’ philosophies and their actions. This research has identified teachers who adjust feedback to student needs or who seek to give agency to students. This work is a promising direction for writing assessment research. The impact of differences in teachers’ experience and knowledge on scoring reliability is important, but equally important are values and power issues embedded in this knowledge and these processes. These are potential areas for more work.
5.3. Integrated writing assessment
Over the past two decades, L2 writing assessment research has paid increasing attention to integrating skills in assessment. These tasks require test takers to perform two or more modalities in a writing task, such as reading-to-write or reading-listening-writing tasks. Three points are often used to justify increasing popularity of integrated assessment. First, integrated assessment features a high degree of authenticity (Gebril & Plakans, Reference Gebril and Plakans2014; Huang, Reference Huang2012; Plakans, Reference Plakans2009) since the integration of language skills approximates the way language is used in real life. For instance, test takers who listen to and read academic passages for writing in integrated tasks face similar demands in academic writing contexts. Second, integrated assessment tasks are believed to have a facilitative impact on the teaching and learning of language in the classroom, hence positive washback, that is, learners and teachers focus on improving language skills with a balanced approach in mind. Lastly, students are less likely to need to draw on background knowledge if given the same reading and/or listening source passages to work on.
Research has explored the relationship between independent (i.e. writing only) and integrated skill writing, showing some similarities as well as differences. A G-theory study carried out by Gebril (Reference Gebril2009, Reference Gebril2010) yielded similar reliability indices for both types of writing tasks. In a similar vein, in an L2 writing quality model formulated by Kim and Crossley (Reference Kim and Crossley2018), the lexical, syntactic, and cohesive features of independent and integrated writing did not demonstrate significant differences in internal factor structure. While the findings of Gebril (Reference Gebril2009, Reference Gebril2010) and Kim and Crossley (Reference Kim and Crossley2018) provided convergent evidence that independent tasks and integrated tasks are similar in some respects, other studies highlight differences between the two tasks in terms of cohesive devices used (Tywoniw & Crossley, Reference Tywoniw and Crossley2019), phraseological complexity (Zhang & Ouyang, Reference Zhang and Ouyang2023), and fluency (Michel et al., Reference Michel, Révész, Lu, Kourtali, Lee and Borges2020). These studies provide evidence that important information is gained from writing-only tasks, but also from integrated tasks that include skills and source materials in the task.
Investigation into the processes adopted by test takers has emerged as an important aspect of test validation; a valid test should elicit cognitive processes and test strategies as intended (Xu & Wu, Reference Xu and Wu2012). Integrated assessment has also been researched from the perspectives of test-taking strategies (Cohen, Reference Cohen, Long and Richards1998). Yang and Plakans (Reference Yang and Plakans2012) found that, in an integrated reading-listening-writing task, test takers employ various strategies to integrate multiple modalities. Their findings validated a multifaceted model of integrated writing strategy, consisting of positive contributions from discourse synthesis (Asencion, Reference Asencion2008; Spivey, Reference Spivey1984) and self-regulating strategies (Xie, Reference Xie2015), as well as negative contributions of test-wiseness strategies. Furthermore, a process-oriented study by Michel et al. (Reference Michel, Révész, Lu, Kourtali, Lee and Borges2020) found that test takers demonstrated the highest lexical productivity during the first and last stages of testing time for the integrated task, whereas fluency (e.g. mean active writing time) in the independent task was evenly distributed throughout all stages of writing. Therefore, integrated tasks are distinctly different from writing-only tasks, not only in terms of written products, but also in the underlying composing processes.
Caution with integrated assessment persists despite its perceived advantages. In terms of construct validity, the psychometric structure of integrated writing has not been fully understood (Knoch & Sitajalabhorn, Reference Knoch and Sitajalabhorn2013). As noted by Cumming (Reference Cumming2013), integrated assessment carries the danger of confounding the measurement of writing with the ability to comprehend the source passage. Indeed, researchers diverge on the role of reading in one’s performance of integrated writing (e.g. Asencion, Reference Asencion2008; Trites & McGroarty, Reference Trites and McGroarty2005; Watanabe, Reference Watanabe2001).
Concerns have also been expressed with regard to the appropriate use of sources, an integral aspect of integrated writing, and research has consequently followed (Hyland, Reference Hyland2009; Plakans, Reference Plakans2009). Defining inappropriate source use as a test-irrelevant construct in integrated writing, Yang & Plakans (Reference Yang and Plakans2012) identified a negative, direct impact of verbatim source use on test scores. In a similar vein, Uludag et al. (Reference Uludag, Lindberg, McDonough and Payant2019) reported that appropriate source use, operationalized as number of ideas incorporated from the source text and accurate content representation, had positive correlations with integrated writing performance. Furthermore, analyzing integrated writing performance involving both reading and listening, Plakans and Gebril (Reference Plakans and Gebril2013) reported that higher-rated essays were characterized by proper use of the listening passage with the inclusion of important ideas, whereas relatively lower-rated essays displayed heavy reliance on the reading passage, copying words and phrases from it. This finding has been echoed in Kyle (Reference Kyle2020) who employed Natural Language Processing to computerize annotation of verbatim source use originating from the reading passage and listening passage. However, Weigle & Parker (Reference Weigle and Parker2012) argued that extensive borrowing from the source text is not a cause for concern, based on their findings that less than ten percent of the study participants exhibited verbatim use of the source texts. Lee et al. (Reference Lee, Liao, Hsiao, Park and Ye2025) carried out a series of generalized linear models, estimating a negative relationship between verbatim source use and organizational features. Based on this finding, they suggested that instructional focus on organization skills (e.g. authorial voice, development of ideas, coherence, organization) might help reduce reliance on verbatim source use.
This area of research holds useful applications for writing assessment in teaching and learning although the studies have not been centered in classroom contexts. For teaching academic writing, which generally requires other language skills (i.e. reading and listening), teaching integration of source material in writing holds potential. Using integrated assessment can provide feedback to students on their strengths and areas for improvement to support this development as academic writers. Research shows that these tasks have similar capacity to measure writing as those that use one skill (independent writing). The underlying skills needed for integrated writing have been found to be complex and go beyond simply avoiding verbatim source use. They require a capacity for discourse synthesis, viewpoint recognition, and connecting one’s own argument to the source. Recent rubrics designed in research could facilitate both the teaching of these complex processes and the provision of focused feedback to students in these areas.
5.3.1. Fairness, justice, and criticality in integrated writing assessment
Integrated assessment tasks show potential to increase fairness and justice in writing assessment. Research has provided considerable evidence for validity and reliability of these tasks (Cumming et al., Reference Cumming, Kantor, Baba, Erdosy, Eouanzoui and James2005; Gebril, Reference Gebril2009, Reference Gebril2010; Kim & Crossley, Reference Kim and Crossley2018). However, with a critical lens, new concerns emerge that should be scrutinized. Firstly, providing all test takers with the same content, through reading source material, intends to diminish advantages such as background knowledge or familiarity with genre (Cumming, Reference Cumming2013). Process studies have revealed that writers draw on reading, writing, and the combination of these skills when composing these tasks (Xu & Wu, Reference Xu and Wu2012; Yang & Plakans, Reference Yang and Plakans2012). This could make these tasks accessible and fair. However, critical questions remain regarding whether this is indeed the case. Background knowledge has an impact on reading, not just on writing. Thus, if a student or test taker engages with the task and draws on knowledge about the topic for the source text and writing, they would still retain a potential advantage over a test taker without such background. Research should be undertaken to explore this effect in such writing tasks.
A second advantage of these tasks is their attention to context (Huang, Reference Huang2012; Plakans, Reference Plakans2009). Particularly for academic settings, writing is not done in isolation but with content drawn from reading or listening. Therefore, the task addresses the realities of language use. The ‘ecological validity’ can increase fairness. However, a lens of criticality might question whether the context of academic writing is itself perpetuating inequity and upholding a dominant variety of writing, that is, one that depends on and cites others’ ideas (Hyland, Reference Hyland2009). In fact, there are many ways that different languages, varieties of languages, and cultures incorporate intertextuality. Intertextuality is the concept that all texts are shaped by other texts. An English or ‘Western’ dominant approach is to cite as an acknowledgement when incorporating ideas or wording published by others. Digging more deeply into intertexuality (Baron, Reference Baron2019), some might question whether anyone truly ‘owns’ words or ideas in today’s society, where we are constantly reading, talking, and digesting words and ideas of others. Who is the owner? Who is the borrower? Is there a power dimension to this distinction? These questions also recall issues from the previous section’s critique of rating and scoring, namely critical questions about standards and bias.
As researchers of integrated writing assessment continue to study these tasks, a critical lens should inform their work on defining the construct, understanding the relationship between reading and writing, developing rubrics and tasks, and exploring underlying values around the ownership of language.
5.4. Teachers’ and learners’ knowledge, beliefs, and perspectives on writing assessment
Research on teachers’ and learners’ perspectives on writing assessment is integrally linked to classroom use of assessments. Studies in both L1 and L2 writing research have illustrated the importance of building teachers’ skills in assessing writing to boost effective writing instruction. Crusan et al. (Reference Crusan, Plakans and Gebril2016) investigated the landscape of second language writing teachers’ assessment literacy in terms of knowledge, belief, and practices. The majority of respondents to a survey reported learning the theoretical and practical fundamentals of L2 writing assessment through coursework during graduate studies; however, about one-fifth reported they had received no formal education in teaching writing or assessing writing. Some widely used assessment methods, such as portfolio assessment, self-assessment, and scoring rubrics, were favorably received by the majority of the participants and extensively used in their classroom contexts. Noting that L2 writing teachers’ beliefs have not been studied extensively, Karaca & Uysal (Reference Karaca and Uysal2021) developed and validated a scale dedicated to measuring L2 writing teachers’ beliefs about the teaching and assessment of L2 writing. The strongest belief that the teachers held was about reader-centered writing, closely followed by the important role of motivation in L2 writing. The weakest belief pertained to the role of L1 (i.e. transfer) and discourse-level competency in English writing.
To be self-efficacious, a teacher needs to understand what goes into the preparation of quality writing instruction and how to translate that understanding into practice. Locke and Johnston (Reference Locke and Johnston2016) defined these as pre-writing instruction strategies and compositional strategy demonstration for L2 writing teachers. They argued for the pivotal role these play in enhancing writing instruction and learning outcomes. For instance, in Dempsey et al. (Reference Dempsey, PytlikZillig and Bruning2009), L1 writing teachers who were aided by an online-based assessment tool in making informed use of an analytic rubric, witnessed significant improvement in their ability to assess student writing. They also reported a heightened degree of self-efficacy in grading learners’ writing.
Research has shown that teachers hold a variety of ideas about what constitutes good feedback practice, displayed in varying foci in their commentary to students’ writing. For instance, in Marefat and Heydari (Reference Marefat and Heydari2016), the evaluations by ‘native-English-speaking’ teachers of essays written by Iranian college students were found to be more consistent and reliable, but they tended to be stricter in rating organization. In contrast, ‘non-native-English-speaking’ teachers’ attention was focused more on grammar than on organization. An extensive dataset was analyzed by Dixon and Moxley (Reference Dixon and Moxley2013), involving 118,611 comments made on 17,433 essays written by L1 first-year college students. Instructors primarily drew attention to rhetorical concerns in these essays, such as organization and use of evidence to support arguments. Lower-order concerns of grammar or formatting were pointed out to a lesser degree, reflecting an emphasis on the discourse structure in L1 writing.
A study by Beck et al. (Reference Beck, Llosa, Black and Anderson2018) explored the ways in which L2 writing teachers established instructional priorities in response to writing produced by a sample of ninth and tenth grade students. It was found that, in their instructional practice, teachers prioritized readily teachable notions, such as structure, rather than the more abstract cognitive processes that go into writing. In particular, teachers expressed uncertainty and even discomfort about addressing ‘knowledge-transformation’ (Bereiter & Scardamalia, Reference Bereiter and Scardamalia1987) aspects of writing, a process in which a writer produces a novel idea from what is presented in the source texts.
To explore teachers’ beliefs and perspectives, Ferris (Reference Ferris2014) set out to identify the philosophies that teachers uphold as they provide feedback to L2 student writers, and the degree to which their philosophies align with practice in their writing classrooms. She found that currently accepted feedback methods, such as peer review and student-teacher conferences, were favored for use among teachers, and were adopted in line with their philosophies to empower learners as competent writers. A study with a similar focus by Ferris et al. (Reference Ferris, Brown, Liu and Stine2011) reported that the teachers were consciously adapting their approach to feedback according to the perceived needs of the students, from explicit attention to grammatical errors to directing students to seek support from external resources, such as a writing center.
Despite these positive findings, research has also shown that the philosophies that teachers express in giving feedback do not always match their practices. Discrepancies have been found between what teachers know or believe they are doing and what they do in the classroom. For example, Li & Barnard (Reference Li and Barnard2011) conducted qualitative interviews with writing tutors working in a university in New Zealand. Despite tutors’ commitment to writing improvement, their findings suggested tutors focused on offering feedback directed to justifying more directly the grades assigned to the students rather than feedback supporting the learners’ writing development.
An in-depth investigation into the misalignment between beliefs and practices was conducted by Mao and Crosthwaite (Reference Mao and Crosthwaite2019). Three areas of discrepancies between the teachers’ beliefs and teaching practices for giving feedback were investigated. Most teachers stated that they provide more direct than indirect feedback, while, in practice, the opposite appeared to be the case. The teachers also underestimated the number of corrections they provided on local issues, while at the same time overestimating the number of corrections provided on global issues. Niu et al. (Reference Niu, Shan and You2021) compared feedback on essays written by Chinese EFL learners provided by Chinese EFL writing teachers, Chinese peers, and American students. Feedback by the teachers focused more locally on form, whereas the student groups’ feedback was more oriented towards meaning. In this study, American students’ feedback was observed to bring the highest rates of successful uptake, as opposed to previous research studies which showed a greater uptake for teachers’ suggestions.
Research on L2 writing assessment has not paid as much attention to the perspectives and beliefs of L2 test takers. Rather, researchers’ attention has primarily shone a spotlight on the written products and their textual features. In some studies, however, learners’ perspectives and beliefs about L2 writing assessment has started to be addressed in innovative ways. For example, Aydin (Reference Aydin2010) studied 204 learners’ perceptions on portfolio assessments through a questionnaire. The students’ perceptions were that their proficiency and knowledge improved with a writing portfolio, but some of the processes and procedures in creating them were problematic. In another student-focused study, Kim (Reference Kim2017) investigated Korean EFL students’ strategies and challenges with the TOEFL writing section, leading to the conclusion that the score from the TOEFL writing is not an accurate representation of what they know about writing. Kim called for more critical approaches by highlighting test taker’s voices about writing tests. Vincelette and Bostic’s (Reference Vincelette and Bostic2013) centered the perceptions of students and instructors on screencast assessment – an innovative approach to giving feedback that allows for audio and video feedback to writing. Students expressed a strong preference for this method of giving feedback. While it did not lessen the time for instructors’ feedback, their comments were more focused on macro-level issues with writing.
Research studies dedicated to uncovering the test taker/learner’s perceptions of tests contribute to validity evidence in L2 writing assessment studies. Xie (Reference Xie2015) studied what test takers perceived as critical in their writing to secure high scores on a writing assessment. The researcher administered a survey to 886 college-level Chinese EFL learners and ran an exploratory factor analysis. The researcher identified two different factors underlying the learners’ responses. The first factor was the perception that a test taker should use words that are sophisticated and less common as well as more complex sentence structures. The second factor, which was somewhat in conflict with the first, was ‘avoiding penalties from raters’ by reducing use of unfamiliar lexical and syntactic choices. Taking the latter approach, which is perhaps reactive, was found to be associated more highly with getting higher scores.
In summary, research has revealed complex perspectives of teachers and learners concerning writing assessment in terms of understanding assessment concepts, test-taking strategies, and attitudes. The implications of this research point to the overall importance of teacher writing assessment literacy. This influences the practices and priorities in their writing instruction. Further, research has suggested areas for focus within such literacy, such as understanding the complex and abstract aspects of writing processes and performances. An area in need of further research and scrutiny is students’ lived experiences with L2 writing tests and the role that the tests play in mediating opportunities and access, such as success in higher education or employment opportunities (Hamid et al., Reference Hamid, Hardy and Reyes2019). Addressing these questions will broaden our understanding of the role and effect of L2 writing tests in teaching and in shaping the lives of language learners.
5.4.1. Research on fairness, justice, and criticality in teachers’ and students’ beliefs
In research on perspectives, beliefs, and knowledge, teachers have not been asked about issues of fairness, equity, or even ethics in their knowledge of writing assessment in classroom contexts. Are they prepared to recognize and address fairness and justice issues within their practices of assessing writing? Humanizing approaches to research incorporate work with teachers’ reflections on their role in systems of inequity as well as experiencing discrimination themselves. In addition to including a critical lens in teachers’ professional learning, the field should consider whether access to assessment literacy is equitable. Are some teachers (dis)advantaged based on systems, context, or cultural background in receiving preparation in writing assessment? In Crusan et al. (Reference Crusan, Plakans and Gebril2016), teachers who were not ‘native’ English speakers were, in fact, more confident and used more assessment strategies. Others found this group of teachers more consistent and reliable in feedback (Marefat & Heydari, Reference Marefat and Heydari2016). These findings contradict the value placed on ‘native-language’ ability, which should not be a proxy to language teacher quality or assessment literacy; in fact, there is counter-evidence to this damaging power dynamic in our field.
Other stakeholders need more attention to address fairness and justice in writing research. Based on our review, the perspectives of learners seems to be a largely under-researched area. However, an encouraging number of researchers have embarked on filling this research gap (Aydin, Reference Aydin2010; Kim, Reference Kim2017; Vincelette & Bostic, Reference Vincelette and Bostic2013; Xie, Reference Xie2015), although more is needed. These voices are important to understand the impact that enacting explicit or hidden standards has on how students are marginalized by assessment practices. More work to highlight what students bring to writing classrooms in terms of experience and knowledge of writing and writing assessment could uncover important findings related to fairness. Their backgrounds are critical to their learning from assessments: How do they take feedback from teachers and does the feedback recognize their humanity and empower them? How does the assessment support their identities as writers? How do the interactions around assessment support or disenfranchise learners? These questions should be considered for multilingual writers whose writing may be overwhelmed with error correction. The field should consider how feedback can be effective in improving writing while supporting and encouraging writers.
As with the research on writing assessment in general, a focus on native and non-native teachers of writing emerges in the research on teachers. This is a problematic dichotomy that we must stop relying on to describe teachers or raters because it undervalues their multilingualism and creates a false dichotomy that perpetuates an unequal power dynamic. Noticeably, very little attention has been paid to multilingualism in writing assessment for learning, as part of a teacher’s profile, and in classroom contexts.
5.5. Feedback in the writing classroom
Writing assessment in classrooms appears most frequently when teachers provide feedback on students’ written work. With this interaction, assessment becomes almost ubiquitous in writing instruction. Its pervasiveness has resulted in a steady line of research to understand teachers’ feedback. Corrective feedback (CF) has been an area of attention in writing research for some time, and it can be considered within the context of assessment, teaching, and learning. The previous section described teachers’ beliefs and perspectives on feedback, which contributes to the discussion of CF. However, research into responding to L2 students’ writing in the classroom has largely concerned itself with identifying the nature of CF by classifying it into different types and comparing their effectiveness in improving learners’ subsequent writing. For instance, focused CF has been identified as more effective when it is directed toward specific features of writing, compared to comprehensive but less focused CF. This finding has been substantiated in a study by Bitchener & Knoch (Reference Bitchener and Knoch2010), who argued for the effectiveness of focused CF based on its benefits for advanced-level ESL learners in improving their accuracy of article usage, a notoriously challenging aspect of English. Research in the past decade into CF has become more nuanced, with several variations of feedback taking hold as classroom practices, such as dynamic written corrective feedback (Kurzer, Reference Kurzer2018). In this mode of feedback, teachers provide specific, targeted, and, most importantly, interactive individualized feedback with a view to helping learners become independent in monitoring and self-editing their written work.
A common concern held by L2 writing teachers in EFL contexts is the possibility that language learners might find processing CF challenging. Zheng & Yu (Reference Zheng and Yu2018) studied the affective, behavioral, and cognitive engagement of lower proficiency Chinese EFL learners in addressing teacher-written CF. While their affective response to corrective feedback was largely positive, the cognitive and behavioral dimensions did not come up to par with the affective dimension, with little improvement observed in writing accuracy. Waller & Papi (Reference Waller and Papi2017) examined the role of language learners’ implicit theories of writing intelligence in accepting written corrective feedback (WCF) from others. Implicit theories of intelligence, namely a set of beliefs that the learner holds as to the fixedness or malleability of his or her intelligence, has been proposed to significantly affect motivation. In the study, a questionnaire measuring learners’ attitudes towards WCF was analyzed, revealing two major dimensions: feedback-seeking orientation and feedback-avoiding orientation. Students who harbored an incremental theory of writing intelligence were more likely to pursue feedback, while those who had a fixed mindset were less open to feedback.
While writing teachers work to provide meaningful feedback, learners are included in this process through peer feedback. Scholarship presents nuanced findings in the benefits of peers’ feedback. For example, H. Zhao (Reference Zhao2010), in a comparative study, examined learners’ use of peer feedback and teacher feedback. The study revealed that, while learners incorporated more teacher feedback in their redrafts than peer feedback, they did so with less understanding of its meaning. Weng et al. (Reference Weng, Zhao and Chen2024) carried out an experimental study to compare the writing scores of two groups which differed by who provided feedback: peers or the teacher. Relative to the control group, students in the experimental condition displayed significantly improvement in appreciation of feedback and appraisal of information. Furthermore, Peng (Reference Peng2024) investigated the differential effects of individual and collaborative processing of teacher feedback on writing development, finding that collaborative processing is associated with greater grammatical accuracy. A case study conducted by Yu & Hu (Reference Yu and Hu2017) revealed that the feedback-giving practice is not uniform across learners but differs according to individual factors in relation to the immediate sociocultural context. In their case study, two students were found to focus on different aspects of writing as they provided feedback to peers. One participant paid particular attention to surface-level language in writing, such as lexical choices and grammar, while the other pointed out more global issues of writing such as idea development or content. In their 2025 article, He et al. (Reference He, Xia, Zhang and Liu2025) suggested that ongoing peer review training deepened Chinese EFL learners’ cognitive engagement with peer feedback, revealing asynchronous development between noticing writing problems and understanding peer feedback. This study also demonstrated that the quality of peer feedback improved significantly over the period of training, highlighting the importance of instructing learners on how to provide quality feedback on a peer’s writing.
With the advent and development of artificial intelligence, provision of feedback has become computerized. Dikli & Bleyle (Reference Dikli and Bleyle2014) compared the perceptions of advanced ESL students about automated feedback and instructor feedback on grammar, usage, and mechanics. The researchers concluded that the students had a sense of trust in the quality of feedback by the automated system, and their positive attitudes towards automated feedback were amplified when it was provided in conjunction with instructor feedback. Xiaosa & Ping (Reference Xiaosa and Ping2025) longitudinally examined the affective, behavioral, and cognitive responses of three English language learners to automated feedback, identifying both intra- and inter-individual differences. Engagement with automated feedback depended crucially on the purpose of language learning, the preference of certain language skills over others, the student’s language proficiency, and affective factors.
Feedback is the quintessential assessment instrument in teaching writing. The research in this area provides direct connections to teaching and learning in writing instruction. As discussed in the previous section on teacher beliefs and knowledge, the alignment of the teacher’s philosophy with student needs is important, including their beliefs about feedback (i.e. method, directness, and focus). This alignment is a critical concern, as teachers’ philosophies should also align with the overall course and thus result in useful feedback and valid assessment. Secondly, corrective feedback (CF), has been a major area of interest to the field. Research is fairly conclusive that CF is better when focused, rather than comprehensive. It also suggests that it does not operate independent of context. Interaction in dynamic assessment can facilitate the usefulness of CF. Furthermore, student responsiveness may impact CF, so attention to how they use it and if more support is needed would improve uptake. The majority of studies are cross-sectional (Bitchener, Reference Bitchener2012), spanning a relatively short time for data collection and analysis. Considering that language development promoted by feedback requires a long-term developmental trajectory, evidence from more longitudinal studies is important.
Lastly, peer feedback has been an area of focus in the past decade revealing a multitude of factors impacting its success, including the human element on rating/scoring as discussed in the previous section. Having students in the role of giving feedback is also impacted by bias and relationships with peers. Providing ample training and ongoing support in peer feedback could potentially minimize these issues with success.
5.5.1. Fairness, justice, and criticality in feedback research
Similar to the critique of research on features of writing performance, language ‘standards’ appear in research on feedback explicitly and implicitly. Since feedback is both assessment and instruction, the approach a teacher takes will perpetuate and dominate a classroom. Feedback can support or undermine opportunities to learn, empower, and promote equity in writing instruction. Research has focused on effectiveness or preferences in feedback but not directly taken on the important role that feedback has in assessment and equity. Critical questions include: Does feedback, from teachers or peers, give students equal opportunities to learn and develop their writing? Do teachers include feedback that values students and empowers them as writers? The attention to corrective feedback draws the focus of assessment to correctness and grammatical accuracy (Bitchener & Knoch, Reference Bitchener and Knoch2010), which are based on dominant norms of language standards. In teaching writing, scholars and instructors recognize there is more to writing than lexico-grammatical accuracy, but it remains unclear how accuracy is defined in relation to different varieties of a language.
Another critical aspect of feedback, mentioned in previous sections, is the human factor. Teachers bring prior experience and bias that impacts their feedback and expectations of students. For example, their focus may be on grading rather than learning (Li & Barnard, Reference Li and Barnard2011) or empowerment. The impact of the teacher on feedback is not just an issue of reliability but it can impact fairness in the classroom and justice in terms of access in a learner’s future. Strategies have been proposed over the years seeking to ameliorate inequity, such as giving feedback without student names on the submission. However, this is not a problem that is so easily resolved. Rubrics are another attempt to quash inequity in giving feedback; however, this too has flaws as mentioned in the previous section on rating. These are spaces for dominant values to preside unquestioned (teacher bias and rubrics) and also areas that are venues for critical work. Instructor-student negotiated rubrics create opportunities for student input on how and what feedback is given and for peers to bring shared experience to the feedback-giving process. Research on these approaches could inform the potential for feedback to disrupt inequity in writing classrooms. The research on dynamic corrective feedback (Kurzer, Reference Kurzer2018) is an example of moving feedback in new directions that could better serve students in light of fairness and justice. Another area of potential research is ‘ungrading’ which disconnects the feedback-revising loop from any form of points or grades, changing both the value place of ‘correctness’ and redirecting writing improvement as intrinsic motivation in and of itself.
6. Concluding thoughts on reviewing writing assessment research through a critical lens
Based on our review of over two decades of writing assessment research, we found five major themes explored through research questions related to writing assessment in language teaching and learning: (1) features of writing performance, (2) rating and scoring, (3) integrated assessment, (4) teachers’ and learners’ knowledge, beliefs, and perspectives, and (5) feedback in the writing classroom. We summarized and highlighted examples of research in each of these areas in relation to and with implications for teaching and learning. For each theme, we provided a critique of the themes related to issues of fairness and justice with a critical lens. Table 2 summarizes findings within each theme of the review.
Table 2. Summary of the themes

Layering the framework (Fig. 1) that we proposed over the review of this research, several overarching areas and actions surfaced to enact critical approaches, which we will discuss further as action items for writing assessment researchers: (1) being more explicit about what/whose standards underlie measures of writing features and definitions of proficiency, (2) not marginalizing multilingualism, and (3) centering voices of stakeholders. While these emerged from our review of a select group articles and journals, we recognize that work which covers these issues may be appearing in venues not explored by this review. Moving forward, we hope the field will continue to investigate and publish research in writing assessment that seeks to dismantle inequity in teaching and learning, as well as to highlight, empower, and value our students’ experiences, languages, and writing.
The current approach in researching characteristics of writing at different proficiency levels deserves a critical lens to uncover their embedded norms and values. As Shohamy (Reference Shohamy1998) states, it is necessary to interrogate ‘whose knowledge the tests are based’ on (p. 333). A critical approach requires more recognition and transparency on the origins of characteristics such as CALF, or other interpretations of quality and measures used therein. What defines proficiency and proficiency levels in English or any language on which L2 writing is assessed? Questions need to be asked about power, equity, and marginalization in relation to how we characterize writing performances in assessments. These are big questions, each comprising an individual research agenda. Scholars looking at sociocultural features of writing are adding depth to the traditional features that have long been the focus of L2 writing, which is promising (C. G. Zhao, Reference Zhao2013). Ignoring the dominant norms at play in selecting ‘features’ or defining ‘proficiency levels’ and not taking them up to trouble our work is perpetuating systems that subjugate multilingual writers.
Scholars in second language acquisition have articulated a ‘bilingual turn’ in the field (Ortega, Reference Ortega2013), rather than centering theories of language acquisition on monolingualism and sequential language learning (first language + second language). Multilingualism recognizes young learners who develop two or more languages concurrently and adults who use both/multiple languages in writing as resources to think and express themselves (DeCosta et al., Reference DeCosta, Li and Lee2022). This reality has been addressed in practice through approaches in bilingual education and in the evolving theory of translanguaging (García, Reference García2017; García & Otheguy, Reference García and Otheguy2020; Wei & Lin, Reference Wei and Lin2019). A special issue of Language Assessment Quarterly (2019) featured research and practice around multilingual assessment.
More research should delve into the cognitive and socially constructed aspects of multilingualism in writing. This work is foundational in building a construct for multilingual writing. To illustrate, an approach to validity integrated with fairness and justice would insist on centering multilingualism in writing assessment, rather than monolingualism, through a construct that reflects the sophisticated complexity in language use of writers with more than one variety of a language or multiple languages. The challenge of a generalized model of multilingualism will be difficult and imperfect, as scholars who work with the notion of translanguaging (Wei, Reference Wei2018) emphasize the individualized ways in which people use multiple languages. Therefore, building an assessment that authentically elicits and reflects the way multilinguals use multiple languages is highly complex but not insurmountable. Developing such assessments starts by defining the construct of multilingualism.
For example, rating scales in writing assessments are by design monolingual. An important and intriguing challenge for the field is to examine how scales could address multilingualism. There have been attempts to design multilingual assessments (Guzman-Orth et al., Reference Guzman-Orth, Lopez and Tolentino2019; Lopez, Reference Lopez2023), but when it comes to scoring, challenges emerge, that is, clarity on what the score is telling us about a learners’ language and how this information will be used. To value multilingualism in writing assessments, important work is needed to understand the domains of multilingual language use. This relates to performance and processes in composing an assessment. Critical perspectives and multilingual turns are present in language education research (DeCosta et al., Reference DeCosta, Li and Lee2022; Ortega, Reference Ortega2013), and, thus, writing assessments can draw on this scholarship.
In reviewing writing assessment research, students’ voices rarely appear in studies. Engaging more with these stakeholders in research would be a step toward acknowledging power structures in assessment. It is promising to see research that includes the voices of test takers, like that of Xie (Reference Xie2015) who sought to understand their perceptions of writing assessment. Research that attends to both teachers’ voices and agency, as well as scholarship that recognizes and illuminates teacher (and rater) biases, should continue to be undertaken in the field. Related to elevating their voices, how we group and characterize teachers and students also needs attention in writing assessment. For example, the dichotomy of ‘native’ versus ‘non-native’ has been repeatedly questioned, and, yet it still routinely appears through the journals reviewed in this time frame. The use of a binary (native/non-native) excludes or diminishes a more common way of languaging by multilinguals. Writing assessment research needs to cease comparing teachers, raters, or students with the imprecise and discriminatory dichotomy of nativeness. Again, pushing back on this is not novel, yet it continues to persist in research questions to categorize writers and teachers. Changing this default in our field will require major rethinking, and yet the reality of multilingualism is not new and is more prevalent than monolingualism. For example, reframing the ‘ownership’ of English to a Global English approach moves away from the territoriality of language. Shohamy (Reference Shohamy1998) recognizes that criticality must question complex issues like this in our assessments, as CLT ‘perceives of language testing as being caught up in an array of questions concerning educational and social systems’ (p. 333).
As we conclude, we point out that, as researchers, we include ourselves in this critique. In much of our past work, especially with integrated writing assessment research, we have not addressed our research through a critical lens. We are working to learn more and consider critical questions of fairness and justice in our research. If this State-of-the-Art article appears as finger pointing, then it is indeed directly pointed at ourselves. Vulnerability and self-reflection should be ongoing in critical work as researchers, providing us with new insights and awareness to inform our work.
In conclusion, the field of writing assessment needs to address questions of fairness, justice, and criticality in our research and practice. In his book, Antiracism in writing assessment: Teaching and assessing writing for a socially just future, Inoue (Reference Inoue2015) states:
Understanding classroom writing assessment as an ecology that can be designed and cultivated shows that the assessment of writing in not simply a decision about whether to use a portfolio or not, or what rubric to use. It is about cultivating and nurturing complex systems that are centrally about sustaining fairness and diverse complexity (p. 12).
This is essential in providing equity for multilingual language learners in schools and classrooms. A critical perspective includes acknowledging that power dimensions intersect and underwrite educational policy, assessments, and instruction (Flores & Rosa, Reference Flores and Rosa2015; Kubota, Reference Kubota2020).
Writing assessments have the potential to value multiple languages and multilingualism as well as provide opportunities; however, they can also have the opposite effect, resulting in denied access. In our review of research over the past decade, the field of L2 writing assessment appears to lack a strong critical stance in research. We need to unpack how our writing assessments are perpetuating hegemony and hierarchies in order to dismantle systems that suppress rather than uplift language learners. The scholars in language testing calling for critical language testing (Lynch, Reference Lynch2001; Shohamy, Reference Shohamy1998, Reference Shohamy2001) and taking social perspectives (McNamara & Roever, Reference McNamara and Roever2006; Roever & Wigglesworth, Reference Roever and Wigglesworth2019; Schissel & Khan, Reference Schissel and Khan2021) have provided directions for this work, which could be extended by drawing on theory such as CritLing and critical discourse analysis (Catalano & Waugh, Reference Catalano and Waugh2020), critical language awareness (Britton & Leonard, Reference Britton and Leonard2020), critical race theory (Delgado & Stanfincic, Reference Delgado and Stanfincic2017) or raciolinguistics (Degollado, Reference Degollado2019; Flores & Rosa, Reference Flores and Rosa2015). These approaches can illuminate and respond to power, linguicism, racism, and inequity in second language writing assessment.
7. Questions arising
In this review, we found established themes in research on writing assessment in published research over the past 20 years in a select group of journals in the field. Through this work we learned about writing of language learners and their teachers as well as the instruments and characteristics of writing in assessments. Reviewing this research through a framework of fairness and justice (Fig. 1), with attention to critical inquiry, we find considerable room for work to address inequities, power dimensions, and marginalized voices in our field. Research plays an important role in uncovering and unpacking assumptions. It is also necessary in rebuilding and identifying pathways forward. From our review of literature and critique, we propose a list of questions for future research to consider. These questions are a potential starting point and not a comprehensive list. Some directly address individual themes in the review, while others can be applied to research across several themes.
• Whose language is the ‘standard’ for language proficiency in writing assessment? Why?
• What domains are privileged in our writing assessments?
• What practices can best illuminate rater and test user bias in writing assessment?
• How can teachers reflect on the role of assessment in marginalizing students’ languages?
• How can writing assessment avoid prescriptive approaches to teaching and learning?
• What constructs capture multilingual writing ability?
• How can we disrupt the false dichotomy of ‘native’ and ‘non-native’ speaker?
• What are ways to include students in writing assessment research?
• How can test development enable fair and just writing assessment?
Reflecting on and researching these questions can transform writing assessment in teaching and learning to create more equitable opportunities for learners and their languages. The questions also complicate processes and create messiness that is difficult to navigate in assessment, which has traditionally upheld precision and accuracy as necessary qualities. However, using a framework that situates validity with fairness and justice (McNamara & Ryan, Reference McNamara and Ryan2011) within the landscape of context and with attention to criticality can create space for writing assessment research to recognize and repel inequity and to value the diversity of languages and languaging of our learners (Lynch, Reference Lynch2001; Shohamy, Reference Shohamy1998, Reference Shohamy2001).
Lia Plakans is a Professor of Multilingual Education in the Department of Teaching and Learning at the University of Iowa’s College of Education. Her teaching and research focus on second language assessment, literacy, and language education, with particular emphasis on the integration of reading and writing, assessment design, and multilingual learners’ experiences. She has co-authored books including Assessment myths: Applying second language research to classroom teaching (University of Michigan Press, 2015) and Reading and writing for academic success (University of Michigan Press, 2003). In her scholarship, she seeks to contribute to equity-centered educator preparation and assessment reform. lia-plakans@uiowa.edu
Kwangmin Lee is an Assistant Professor of Teaching English to Speakers of Other Languages (TESOL) in the Department of Special Education and Literacy Studies at Western Michigan University. He works with in-service language educators across Michigan as they pursue a master’s degree in TESOL. His research centers on second and foreign language assessment, particularly integrated writing assessment and quantitative research methods. Most recently, he has been expanding his research repertoire by exploring the use of Artificial Intelligence (AI) and Machine Learning (ML) in language assessment to support task innovation. kwangmin.1.lee@wmich.edu