Is generative AI ready to replace human raters in scoring EFL writing? Comparison of human and automated essay evaluation

Topuz, Arif; Yıldız, Mine; Taşlıbeyaz, Elif; Polat, Hamza; Kurşun, Engin

doi:10.30191/ets.202507_28(3).sp04

Is generative AI ready to replace human raters in scoring EFL writing? Comparison of human and automated essay evaluation

Topuz A. C., Yıldız M., Taşlıbeyaz E., Polat H., Kurşun E.

EDUCATIONAL TECHNOLOGY & SOCIETY, cilt.28, sa.3, ss.36-50, 2025 (SSCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 28 Sayı: 3
Basım Tarihi: 2025
Doi Numarası: 10.30191/ets.202507_28(3).sp04
Dergi Adı: EDUCATIONAL TECHNOLOGY & SOCIETY
Derginin Tarandığı İndeksler: Social Sciences Citation Index (SSCI), Scopus, Academic Search Premier, EBSCO Education Source, Education Abstracts, Educational research abstracts (ERA), ERIC (Education Resources Information Center), INSPEC, Psycinfo, Directory of Open Access Journals
Sayfa Sayıları: ss.36-50
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Atatürk Üniversitesi Adresli: Evet

Özet

Language teachers mostly spend much time scoring students' writing and may sometimes hesitate to provide reliable scores since essay scoring is time-consuming. In this regard, AI-based Automated Essay Scoring (AES) systems have been used and Generative AI (GenAI) has recently appeared with its potential in scoring essays. Therefore, this study aims to focus on the differences and relationships between human raters (HR) and GenAI scores for the essays produced by English as a Foreign Language (EFL) learners. The data consisted of 210 essays produced by 35 undergraduate students. Two HR and GenAI evaluated the essays using an analytical rubric divided into the following five factors: (1) ideas, (2) organization and coherence, (3) support, (4) style, and (5) mechanics. This study found that there were significant differences between the scores given by HR and those generated by GenAI, as well as variations among the HR themselves; nonetheless, GenAI's scores were similar across dual evaluations. It was also noted that GenAI's scores were statistically significantly lower than those of HR. On the other side, it was found that HR scores correlated weakly, while GenAI scores correlated strongly. A significant correlation was observed between HR-1 and GenAI across all factors, whereas the second HR-2 showed significant correlations with GenAI in only three factors. Therefore, this study can guide EFL teachers on how to reduce their workload in writing assessments by giving GenAI more responsibility in scoring essays. The study also offers many suggestions for future studies on AES based on the findings and limitations of the study.