Is generative AI ready to replace human raters in scoring EFL writing? Comparison of human and automated essay evaluation


Creative Commons License

Topuz A. C., Yıldız M., Taşlıbeyaz E., Polat H., Kurşun E.

EDUCATIONAL TECHNOLOGY & SOCIETY, cilt.28, sa.3, ss.36-50, 2025 (SSCI, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 28 Sayı: 3
  • Basım Tarihi: 2025
  • Doi Numarası: 10.30191/ets.202507_28(3).sp04
  • Dergi Adı: EDUCATIONAL TECHNOLOGY & SOCIETY
  • Derginin Tarandığı İndeksler: Social Sciences Citation Index (SSCI), Scopus, Academic Search Premier, EBSCO Education Source, Education Abstracts, Educational research abstracts (ERA), ERIC (Education Resources Information Center), INSPEC, Psycinfo, Directory of Open Access Journals
  • Sayfa Sayıları: ss.36-50
  • Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
  • Atatürk Üniversitesi Adresli: Evet

Özet

Language teachers mostly spend much time scoring students' writing and may sometimes hesitate to provide reliable scores since essay scoring is time-consuming. In this regard, AI-based Automated Essay Scoring (AES) systems have been used and Generative AI (GenAI) has recently appeared with its potential in scoring essays. Therefore, this study aims to focus on the differences and relationships between human raters (HR) and GenAI scores for the essays produced by English as a Foreign Language (EFL) learners. The data consisted of 210 essays produced by 35 undergraduate students. Two HR and GenAI evaluated the essays using an analytical rubric divided into the following five factors: (1) ideas, (2) organization and coherence, (3) support, (4) style, and (5) mechanics. This study found that there were significant differences between the scores given by HR and those generated by GenAI, as well as variations among the HR themselves; nonetheless, GenAI's scores were similar across dual evaluations. It was also noted that GenAI's scores were statistically significantly lower than those of HR. On the other side, it was found that HR scores correlated weakly, while GenAI scores correlated strongly. A significant correlation was observed between HR-1 and GenAI across all factors, whereas the second HR-2 showed significant correlations with GenAI in only three factors. Therefore, this study can guide EFL teachers on how to reduce their workload in writing assessments by giving GenAI more responsibility in scoring essays. The study also offers many suggestions for future studies on AES based on the findings and limitations of the study.