Exploring the dual impact of AI in post-entry language assessment: Potentials and pitfalls

Tiancheng Zhang; Rosemary Erlam; Morena Botelho de Magalhães

doi:10.1017/S0267190525000030

Exploring the dual impact of AI in post-entry language assessment: Potentials and pitfalls

Published online by Cambridge University Press: 05 June 2025

Tiancheng Zhang

Rosemary Erlam

and

Morena Botelho de Magalhães

Show author details

Tiancheng Zhang*: Affiliation:
Faculty of International Studies, Southwestern University of Finance and Economics, Chengdu, China DELNA, The University of Auckland, Auckland, New Zealand
Rosemary Erlam: Affiliation:
Faculty of Arts and Education, The University of Auckland, Auckland, New Zealand
Morena Botelho de Magalhães: Affiliation:
DELNA, The University of Auckland, Auckland, New Zealand
*: Corresponding author: Tiancheng Zhang; Email: tzha305@aucklanduni.ac.nz

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

This paper explores the complex dynamics of using AI, particularly generative artificial intelligence (GenAI), in post-entry language assessment (PELA) at the tertiary level. Empirical data from trials with Diagnostic English Language Needs Assessment (DELNA), the University of Auckland’s PELA, are presented.

The first study examines the capability of GenAI to generate reading text and assessment items that might be suitable for use in DELNA. A trial of this GenAI-generated academic reading assessment on a group of target participants (n = 132) further evaluates its suitability. The second study investigates the use of a fine-tuned GPT-4o model for rating DELNA writing tasks, assessing whether automated writing evaluation (AWE) provides feedback of comparable quality to human raters. Findings indicate that while GenAI shows promise in generating content for reading assessments, expert evaluations reveal a need for refinement in question complexity and targeting specific subskills. In AWE, the fine-tuned GPT-4o model aligns closely with human raters in overall scoring but requires improvement in delivering detailed and actionable feedback.

A Strengths, Weaknesses, Opportunities, and Threats analysis highlights AI’s potential to enhance PELA by increasing efficiency, adaptability, and personalization. AI could extend PELA’s scope to areas such as oral skills and dynamic assessment. However, challenges such as academic integrity and data privacy remain critical concerns. The paper proposes a collaborative model integrating human expertise and AI in PELA, emphasizing the irreplaceable value of human judgment. We also emphasize the need to establish clear guidelines for a human-centered AI approach within PELA to maintain ethical standards and uphold assessment integrity.

Keywords

Generative Artificial Intelligence (GenAI)Post-Entry Language Assessment Diagnostic English Language Needs Assessment (DELNA)Automated Writing Evaluation Automated Item Generation Human-Centered AI

Information

Type: Research Article
Information: Annual Review of Applied Linguistics , Volume 45 , March 2025 , pp. 274 - 293

DOI: https://doi.org/10.1017/S0267190525000030 [Opens in a new window]
Copyright: © The Author(s), 2025. Published by Cambridge University Press.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Andringa, S., & Godfroid, A. (2020). Sampling bias and the problem of generalizability in applied linguistics. Annual Review of Applied Linguistics, 40, 134–142. doi:10.1017/S0267190520000033CrossRef Google Scholar

Anson, C., Filkins, S., Hicks, T., O’Neill, P., Pierce, K. M., & Winn, M. (2013). NCTE position statement on machine scoring: Machine scoring fails the test. National Council of Teachers of English. https://ncte.org/statement/machine_scoring/Google Scholar

Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & Von Davier, A. A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, 903077. doi:10.3389/frai.2022.903077CrossRef Google Scholar PubMed

Australian Government, Department of Education. (2023). Australian Framework for Generative Artificial Intelligence (AI) in Schools. Retrieved September 14 , 2024, from https://www.education.gov.au/schooling/resources/australian-framework-generative-artificial-intelligence-ai-schools Google Scholar

Bermouth, J. R. (1970). On the theory of achievement test items. University of Chicago Press.Google Scholar

Bezirhan, U., & Von Davier, M. (2023). Automated reading passage generation with OpenAI’s large language model. Computers and Education: Artificial Intelligence, 5, 100161. doi:10.1016/j.caeai.2023.100161Google Scholar

Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: How it compares with human scoring. Education and Information Technologies, 30, 2041–2058. doi:10.5281/zenodo.8115784CrossRef Google Scholar

Cardwell, R., LaFlair, G. T., & Settles, B. (2024). Duolingo English Test: Technical Manual. Duolingo, Inc. Retrieved November 1, 2024, from https://duolingo-papers.s3.amazonaws.com/other/technical_manual.pdf Google Scholar

Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature. Geoscientific Model Development, 7(3), 1247–1250. doi:10.5194/gmd-7-1247-2014CrossRef Google Scholar

Doe, C. (2014). Diagnostic English Language Needs Assessment (DELNA). Language Testing, 31(4), 537–543. doi:10.1177/0265532214538225CrossRef Google Scholar

Doewes, A., Kurdhi, N. A., & Saxena, A. (2023). Evaluating Quadratic Weighted Kappa as the standard performance metric for automated essay scoring. 16th International Conference on Educational Data Mining. https://doi.org/10.5281/zenodo.8115784CrossRef Google Scholar

Dunworth, K. (2009). An investigation into post-entry English language assessment in Australian universities. Journal of Academic Language and Learning, 3(1), 1–13. https://journal.aall.org.au/index.php/jall/article/view/67 Google Scholar

Educational Testing Service. (2023). Responsible use of AI in assessment. Retrieved October 15, 2024, from https://www.ets.org/Rebrand/pdf/ETS_Convening_executive_summary_for_the_AI_Guidelines.pdf Google Scholar

Elder, C., Barkhuizen, G., Knoch, U., & Von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24(1), 37–64. doi:10.1177/0265532207071511CrossRef Google Scholar

Elder, C., & Erlam, R. (2001). Development and validation of the Diagnostic English Language Needs Assessment (DELNA): Final report. The University of Auckland, Department of Applied Language Studies and Linguistics.Google Scholar

Erlam, R., von Randow, J., & Read, J. (2013). Investigating an online rater training program: Product and process. Papers in Language Testing and Assessment, 2(1), 1–29. doi:10.58379/aada5911Google Scholar

Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2024). A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International, 61(3), 460–474. doi:10.1080/14703297.2023.2195846CrossRef Google Scholar

Gamer, M., Lemon, J., Fellows, I., & Singh, P. (2019). irr: Various coefficients of interrater reliability and agreement (Version 0.84.1) [R package]. Comprehensive R Archive Network (CRAN). https://CRAN.R-project.org/package=irr Google Scholar

Giray, L. (2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 51(12), 2629–2633. doi:10.1007/s10439-023-03272-4CrossRef Google Scholar PubMed

Grossmann, I., Feinberg, M., Parker, D. C., Christakis, N. A., Tetlock, P. E., & Cunningham, W. A. (2023). AI and the transformation of social science research. Science, 380(6650), 1108–1109. doi:10.1126/science.adi1778CrossRef Google Scholar PubMed

Hamner, B., & Frasco, M. (2018). Metrics: Evaluation Metrics for Machine Learning (Version 0.1.4). R package. Comprehensive R Archive Network (CRAN). Retrieved October 10, 2024, from https://CRAN.R-project.org/package=Metrics Google Scholar

Hao, J., Alina, A., Yaneva, V., Lottridge, S., von Davier, M., & Harris, D. J. (2024). Transforming assessment: The impacts and implications of large language models and generative AI. Educational Measurement, 43(2), 16–29. doi:10.1111/emip.12602CrossRef Google Scholar

Hockly, N. (2023). Artificial intelligence in English language teaching: The good, the bad and the ugly. RELC Journal, 54(2), 445–451. doi:10.1177/00336882231168504CrossRef Google Scholar

Ifenthaler, D. (2023). Automated essay scoring systems. In Zawacki-Richter, O. & Jung, I. (Eds), Handbook of open, distance and digital education (pp. 1057–1071). Springer.CrossRef Google Scholar

Klebanov, B. B., & Madnani, N. (2022). Automated essay scoring. Morgan & Claypool Publishers.CrossRef Google Scholar

Knoth, N., Tolzin, A., Janson, A., & Leimeister, J. M. (2024). AI literacy and its implications for prompt engineering strategies. Computers and Education: Artificial Intelligence, 6, 100225. doi:10.1016/j.caeai.2024.100225Google Scholar

Kohnke, L., Moorhouse, B. L., & Zou, D. (2023). ChatGPT for language teaching and learning. RELC Journal, 54(2), 537–550. doi:10.1177/00336882231162868CrossRef Google Scholar

Lee, Y. W. (2015). Diagnosing diagnostic language assessment. Language Testing, 32(3), 299–316. doi:10.1177/0265532214565387CrossRef Google Scholar

Lin, Z., & Chen, H. (2024). Investigating the capability of ChatGPT for generating multiple-choice reading comprehension items. System, 123, 103344. doi:10.1016/j.system.2024.103344CrossRef Google Scholar

Liu, J. In Chinese. (2024). Language assessment in the era of artificial intelligence: Opportunities and challenges. Modern Foreign Languages, 47(6), 34–49. 10.20071/j.cnki.xdwy.20240824.008.Google Scholar

Liu, X. H. (2018). Establishing the foundation for a diagnostic assessment of reading in English for academic purposes [Doctoral thesis, University of Auckland.] UOA Library. http://hdl.handle.net/2292/37053 Google Scholar

Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. doi:10.1016/j.rmal.2023.100050CrossRef Google Scholar

Murray, N. (2016). Standards of English in higher education: Issues, challenges and strategies. Cambridge University Press.Google Scholar

Peters, M. E., Ruder, S., & Smith, N. A. (2019). To tune or not to tune? Adapting pretrained representations to diverse tasks. arXiv preprint, arXiv:1903.05987. doi:10.48550/arXiv.1903.05987CrossRef Google Scholar

Powers, D. E., Escoffery, D. S., & Duchnowski, M. P. (2015). Validating automated essay scoring: A (modest) refinement of the “gold standard. Applied Measurement in Education, 28(2), 130–142. doi:10.1080/08957347.2014.1002920CrossRef Google Scholar

Ramesh, D., & Sanampudi, S. K. (2021). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. doi:10.1007/s10462-021-10068-2CrossRef Google Scholar PubMed

Ratnayanti, R. (2023). Artificial Intellegence (AI) in association with language assessment. Journal of Science, Education and Studies, 2(3). doi:10.30651/jses.v2i3.20346Google Scholar

Read, J. (2008). Identifying academic language needs through diagnostic assessment. Journal of English for Academic Purposes, 7(3), 180–190. doi:10.1016/j.jeap.2008.02.001CrossRef Google Scholar

Read, J. (2015). Issues in post-entry language assessment in English-medium universities. Language Teaching, 48(2), 217–234. doi:10.1017/s0261444813000190CrossRef Google Scholar

Read, J., Ed.. (2016). Post-admission language assessment of university students. Springer.CrossRef Google Scholar

Shin, D., & Lee, J. H. (2023). Can ChatGPT make reading comprehension testing items on par with human experts? Language, Learning and Technology, 27(3), 27–40. https://hdl.handle.net/10125/73530 CrossRef Google Scholar

Shneiderman, B. (2022). Human-centered AI. Oxford University Press.CrossRef Google Scholar

UNESCO. (2024). AI competency framework for students. Retrieved November 21, 2024, from https://www.unesco.org/en/articles/ai-competency-framework-students Google Scholar

The University of Auckland. (2020). The University of Auckland vision and strategic plan 2020-2030. Retrieved November 1, 2024, from https://www.auckland.ac.nz/assets/about-us/the-university/official-publications/strategic-plan/2021-2030/taumata-teitei-vision-2030-and-strategic-plan-2025.pdf Google Scholar

The University of Auckland. (2024). DELNA handbook. Retrieved October 17, 2024, from https://www.auckland.ac.nz/assets/delna/delna/delna-handbook-2024.pdf Google Scholar

Wang, Q., & Gayed, J. M. (2024). Effectiveness of large language models in automated evaluation of argumentative essays: Finetuning vs. zero-shot prompting. Computer Assisted Language Learning, 1–29. doi:10.1080/09588221.2024.2371395CrossRef Google Scholar

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., … Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Oh, A. H., Agarwal, A., Belgrave, D. & Cho, K. (Eds), Advances in neural information processing systems (pp. 24824–24837). Curran Associates, Inc.Google Scholar

Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. doi:10.1111/j.1745-3992.2011.00223.xCrossRef Google Scholar

Yang, L., & Li, R. (2024). ChatGPT for L2 learning: Current status and implications. System, 124, 103351. doi:10.1016/j.system.2024.103351CrossRef Google Scholar

Yavuz, F., Çelik, Ö., & Çelik, G. Y. (2024). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments. British Journal of Educational Technology, 56(1), 150–166. doi:10.1111/bjet.13494CrossRef Google Scholar

Yuan, S., He, T., Huang, H., Hou, R., & Wang, M. (2020). Automated Chinese essay scoring based on deep learning. Computers, Materials and Continua, 65(1), 817–833. doi:10.32604/cmc.2020.010471CrossRef Google Scholar

Zhai, N., & Ma, X. (2021). Automated writing evaluation (AWE) feedback: A systematic investigation of college students’ acceptance. Computer Assisted Language Learning, 35(9), 2817–2842. doi:10.1080/09588221.2021.1897019CrossRef Google Scholar

Zhai, N., & Ma, X. (2022). The effectiveness of automated writing evaluation on writing quality: A meta-analysis. Journal of Educational Computing Research, 61(4), 875–900. doi:10.1177/07356331221127300CrossRef Google Scholar

Zirar, A. (2023). Exploring the impact of language models, such as ChatGPT, on student learning and assessment. Review of Education, 11(3), e3433. doi:10.1002/rev3.3433CrossRef Google Scholar

Article contents

Exploring the dual impact of AI in post-entry language assessment: Potentials and pitfalls

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests