Hostname: page-component-5b777bbd6c-7sgmh Total loading time: 0 Render date: 2025-06-19T11:38:23.433Z Has data issue: false hasContentIssue false

Exploring the dual impact of AI in post-entry language assessment: Potentials and pitfalls

Published online by Cambridge University Press:  05 June 2025

Tiancheng Zhang*
Affiliation:
Faculty of International Studies, Southwestern University of Finance and Economics, Chengdu, China DELNA, The University of Auckland, Auckland, New Zealand
Rosemary Erlam
Affiliation:
Faculty of Arts and Education, The University of Auckland, Auckland, New Zealand
Morena Botelho de Magalhães
Affiliation:
DELNA, The University of Auckland, Auckland, New Zealand
*
Corresponding author: Tiancheng Zhang; Email: tzha305@aucklanduni.ac.nz

Abstract

This paper explores the complex dynamics of using AI, particularly generative artificial intelligence (GenAI), in post-entry language assessment (PELA) at the tertiary level. Empirical data from trials with Diagnostic English Language Needs Assessment (DELNA), the University of Auckland’s PELA, are presented.

The first study examines the capability of GenAI to generate reading text and assessment items that might be suitable for use in DELNA. A trial of this GenAI-generated academic reading assessment on a group of target participants (n = 132) further evaluates its suitability. The second study investigates the use of a fine-tuned GPT-4o model for rating DELNA writing tasks, assessing whether automated writing evaluation (AWE) provides feedback of comparable quality to human raters. Findings indicate that while GenAI shows promise in generating content for reading assessments, expert evaluations reveal a need for refinement in question complexity and targeting specific subskills. In AWE, the fine-tuned GPT-4o model aligns closely with human raters in overall scoring but requires improvement in delivering detailed and actionable feedback.

A Strengths, Weaknesses, Opportunities, and Threats analysis highlights AI’s potential to enhance PELA by increasing efficiency, adaptability, and personalization. AI could extend PELA’s scope to areas such as oral skills and dynamic assessment. However, challenges such as academic integrity and data privacy remain critical concerns. The paper proposes a collaborative model integrating human expertise and AI in PELA, emphasizing the irreplaceable value of human judgment. We also emphasize the need to establish clear guidelines for a human-centered AI approach within PELA to maintain ethical standards and uphold assessment integrity.

Type
Research Article
Copyright
© The Author(s), 2025. Published by Cambridge University Press.

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Andringa, S., & Godfroid, A. (2020). Sampling bias and the problem of generalizability in applied linguistics. Annual Review of Applied Linguistics, 40, 134142. doi:10.1017/S0267190520000033CrossRefGoogle Scholar
Anson, C., Filkins, S., Hicks, T., O’Neill, P., Pierce, K. M., & Winn, M. (2013). NCTE position statement on machine scoring: Machine scoring fails the test. National Council of Teachers of English. https://ncte.org/statement/machine_scoring/Google Scholar
Attali, Y., Runge, A., LaFlair, G. T., Yancey, K., Goodwin, S., Park, Y., & Von Davier, A. A. (2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, . doi:10.3389/frai.2022.903077CrossRefGoogle ScholarPubMed
Australian Government, Department of Education. (2023). Australian Framework for Generative Artificial Intelligence (AI) in Schools. Retrieved September 14 , 2024, from https://www.education.gov.au/schooling/resources/australian-framework-generative-artificial-intelligence-ai-schoolsGoogle Scholar
Bermouth, J. R. (1970). On the theory of achievement test items. University of Chicago Press.Google Scholar
Bezirhan, U., & Von Davier, M. (2023). Automated reading passage generation with OpenAI’s large language model. Computers and Education: Artificial Intelligence, 5, . doi:10.1016/j.caeai.2023.100161Google Scholar
Bui, N. M., & Barrot, J. S. (2024). ChatGPT as an automated essay scoring tool in the writing classrooms: How it compares with human scoring. Education and Information Technologies, 30, 20412058. doi:10.5281/zenodo.8115784CrossRefGoogle Scholar
Cardwell, R., LaFlair, G. T., & Settles, B. (2024). Duolingo English Test: Technical Manual. Duolingo, Inc. Retrieved November 1, 2024, from https://duolingo-papers.s3.amazonaws.com/other/technical_manual.pdfGoogle Scholar
Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature. Geoscientific Model Development, 7(3), 12471250. doi:10.5194/gmd-7-1247-2014CrossRefGoogle Scholar
Doe, C. (2014). Diagnostic English Language Needs Assessment (DELNA). Language Testing, 31(4), 537543. doi:10.1177/0265532214538225CrossRefGoogle Scholar
Doewes, A., Kurdhi, N. A., & Saxena, A. (2023). Evaluating Quadratic Weighted Kappa as the standard performance metric for automated essay scoring. 16th International Conference on Educational Data Mining. https://doi.org/10.5281/zenodo.8115784CrossRefGoogle Scholar
Dunworth, K. (2009). An investigation into post-entry English language assessment in Australian universities. Journal of Academic Language and Learning, 3(1), 113. https://journal.aall.org.au/index.php/jall/article/view/67Google Scholar
Educational Testing Service. (2023). Responsible use of AI in assessment. Retrieved October 15, 2024, from https://www.ets.org/Rebrand/pdf/ETS_Convening_executive_summary_for_the_AI_Guidelines.pdfGoogle Scholar
Elder, C., Barkhuizen, G., Knoch, U., & Von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24(1), 3764. doi:10.1177/0265532207071511CrossRefGoogle Scholar
Elder, C., & Erlam, R. (2001). Development and validation of the Diagnostic English Language Needs Assessment (DELNA): Final report. The University of Auckland, Department of Applied Language Studies and Linguistics.Google Scholar
Erlam, R., von Randow, J., & Read, J. (2013). Investigating an online rater training program: Product and process. Papers in Language Testing and Assessment, 2(1), 129. doi:10.58379/aada5911Google Scholar
Farrokhnia, M., Banihashem, S. K., Noroozi, O., & Wals, A. (2024). A SWOT analysis of ChatGPT: Implications for educational practice and research. Innovations in Education and Teaching International, 61(3), 460474. doi:10.1080/14703297.2023.2195846CrossRefGoogle Scholar
Gamer, M., Lemon, J., Fellows, I., & Singh, P. (2019). irr: Various coefficients of interrater reliability and agreement (Version 0.84.1) [R package]. Comprehensive R Archive Network (CRAN). https://CRAN.R-project.org/package=irrGoogle Scholar
Giray, L. (2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 51(12), 26292633. doi:10.1007/s10439-023-03272-4CrossRefGoogle ScholarPubMed
Grossmann, I., Feinberg, M., Parker, D. C., Christakis, N. A., Tetlock, P. E., & Cunningham, W. A. (2023). AI and the transformation of social science research. Science, 380(6650), 11081109. doi:10.1126/science.adi1778CrossRefGoogle ScholarPubMed
Hamner, B., & Frasco, M. (2018). Metrics: Evaluation Metrics for Machine Learning (Version 0.1.4). R package. Comprehensive R Archive Network (CRAN). Retrieved October 10, 2024, from https://CRAN.R-project.org/package=MetricsGoogle Scholar
Hao, J., Alina, A., Yaneva, V., Lottridge, S., von Davier, M., & Harris, D. J. (2024). Transforming assessment: The impacts and implications of large language models and generative AI. Educational Measurement, 43(2), 1629. doi:10.1111/emip.12602CrossRefGoogle Scholar
Hockly, N. (2023). Artificial intelligence in English language teaching: The good, the bad and the ugly. RELC Journal, 54(2), 445451. doi:10.1177/00336882231168504CrossRefGoogle Scholar
Ifenthaler, D. (2023). Automated essay scoring systems. In Zawacki-Richter, O. & Jung, I. (Eds), Handbook of open, distance and digital education (pp. 10571071). Springer.10.1007/978-981-19-2080-6_59CrossRefGoogle Scholar
Klebanov, B. B., & Madnani, N. (2022). Automated essay scoring. Morgan & Claypool Publishers.10.1007/978-3-031-02182-4CrossRefGoogle Scholar
Knoth, N., Tolzin, A., Janson, A., & Leimeister, J. M. (2024). AI literacy and its implications for prompt engineering strategies. Computers and Education: Artificial Intelligence, 6, . doi:10.1016/j.caeai.2024.100225Google Scholar
Kohnke, L., Moorhouse, B. L., & Zou, D. (2023). ChatGPT for language teaching and learning. RELC Journal, 54(2), 537550. doi:10.1177/00336882231162868CrossRefGoogle Scholar
Lee, Y. W. (2015). Diagnosing diagnostic language assessment. Language Testing, 32(3), 299316. doi:10.1177/0265532214565387CrossRefGoogle Scholar
Lin, Z., & Chen, H. (2024). Investigating the capability of ChatGPT for generating multiple-choice reading comprehension items. System, 123, . doi:10.1016/j.system.2024.103344CrossRefGoogle Scholar
Liu, J. In Chinese. (2024). Language assessment in the era of artificial intelligence: Opportunities and challenges. Modern Foreign Languages, 47(6), 3449. 10.20071/j.cnki.xdwy.20240824.008.Google Scholar
Liu, X. H. (2018). Establishing the foundation for a diagnostic assessment of reading in English for academic purposes [Doctoral thesis, University of Auckland.] UOA Library. http://hdl.handle.net/2292/37053Google Scholar
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), . doi:10.1016/j.rmal.2023.100050CrossRefGoogle Scholar
Murray, N. (2016). Standards of English in higher education: Issues, challenges and strategies. Cambridge University Press.Google Scholar
Peters, M. E., Ruder, S., & Smith, N. A. (2019). To tune or not to tune? Adapting pretrained representations to diverse tasks. arXiv preprint, arXiv:1903.05987. doi:10.48550/arXiv.1903.05987CrossRefGoogle Scholar
Powers, D. E., Escoffery, D. S., & Duchnowski, M. P. (2015). Validating automated essay scoring: A (modest) refinement of the “gold standard. Applied Measurement in Education, 28(2), 130142. doi:10.1080/08957347.2014.1002920CrossRefGoogle Scholar
Ramesh, D., & Sanampudi, S. K. (2021). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 24952527. doi:10.1007/s10462-021-10068-2CrossRefGoogle ScholarPubMed
Ratnayanti, R. (2023). Artificial Intellegence (AI) in association with language assessment. Journal of Science, Education and Studies, 2(3). doi:10.30651/jses.v2i3.20346Google Scholar
Read, J. (2008). Identifying academic language needs through diagnostic assessment. Journal of English for Academic Purposes, 7(3), 180190. doi:10.1016/j.jeap.2008.02.001CrossRefGoogle Scholar
Read, J. (2015). Issues in post-entry language assessment in English-medium universities. Language Teaching, 48(2), 217234. doi:10.1017/s0261444813000190CrossRefGoogle Scholar
Read, J., Ed.. (2016). Post-admission language assessment of university students. Springer.10.1007/978-3-319-39192-2CrossRefGoogle Scholar
Shin, D., & Lee, J. H. (2023). Can ChatGPT make reading comprehension testing items on par with human experts? Language, Learning and Technology, 27(3), 2740. https://hdl.handle.net/10125/73530Google Scholar
Shneiderman, B. (2022). Human-centered AI. Oxford University Press.10.1093/oso/9780192845290.001.0001CrossRefGoogle Scholar
UNESCO. (2024). AI competency framework for students. Retrieved November 21, 2024, from https://www.unesco.org/en/articles/ai-competency-framework-studentsGoogle Scholar
The University of Auckland. (2020). The University of Auckland vision and strategic plan 2020-2030. Retrieved November 1, 2024, from https://www.auckland.ac.nz/assets/about-us/the-university/official-publications/strategic-plan/2021-2030/taumata-teitei-vision-2030-and-strategic-plan-2025.pdfGoogle Scholar
The University of Auckland. (2024). DELNA handbook. Retrieved October 17, 2024, from https://www.auckland.ac.nz/assets/delna/delna/delna-handbook-2024.pdfGoogle Scholar
Wang, Q., & Gayed, J. M. (2024). Effectiveness of large language models in automated evaluation of argumentative essays: Finetuning vs. zero-shot prompting. Computer Assisted Language Learning, 129. doi:10.1080/09588221.2024.2371395CrossRefGoogle Scholar
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., … Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Oh, A. H., Agarwal, A., Belgrave, D. & Cho, K. (Eds), Advances in neural information processing systems (pp. 2482424837). Curran Associates, Inc.Google Scholar
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 213. doi:10.1111/j.1745-3992.2011.00223.xCrossRefGoogle Scholar
Yang, L., & Li, R. (2024). ChatGPT for L2 learning: Current status and implications. System, 124, . doi:10.1016/j.system.2024.103351CrossRefGoogle Scholar
Yavuz, F., Çelik, Ö., & Çelik, G. Y. (2024). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric‐based assessments. British Journal of Educational Technology, 56(1), 150166. doi:10.1111/bjet.13494CrossRefGoogle Scholar
Yuan, S., He, T., Huang, H., Hou, R., & Wang, M. (2020). Automated Chinese essay scoring based on deep learning. Computers, Materials and Continua, 65(1), 817833. doi:10.32604/cmc.2020.010471CrossRefGoogle Scholar
Zhai, N., & Ma, X. (2021). Automated writing evaluation (AWE) feedback: A systematic investigation of college students’ acceptance. Computer Assisted Language Learning, 35(9), 28172842. doi:10.1080/09588221.2021.1897019CrossRefGoogle Scholar
Zhai, N., & Ma, X. (2022). The effectiveness of automated writing evaluation on writing quality: A meta-analysis. Journal of Educational Computing Research, 61(4), 875900. doi:10.1177/07356331221127300CrossRefGoogle Scholar
Zirar, A. (2023). Exploring the impact of language models, such as ChatGPT, on student learning and assessment. Review of Education, 11(3), . doi:10.1002/rev3.3433CrossRefGoogle Scholar