Assessing the accuracy, reliability, quality, and readability of artificial intelligence chatbots in patient education: insights from zirconia crowns


Acar N. K., ŞENGÜL F., Bardakci E., ÇELİKEL P.

BMC ORAL HEALTH, cilt.26, sa.1, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 26 Sayı: 1
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1186/s12903-026-07884-9
  • Dergi Adı: BMC ORAL HEALTH
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE, Directory of Open Access Journals
  • Atatürk Üniversitesi Adresli: Evet

Özet

Background This study aims to conduct a comparative evaluation of the accuracy, reliability, quality, and readability of chatbot-generated responses from five widely used artificial intelligence (AI) chatbots when addressing frequently asked questions (FAQs) on zirconia and pediatric zirconia crowns. Methods Twenty FAQs on zirconia crowns were derived from two Google searches ("frequently asked questions about zirconia" and "frequently asked questions about pediatric zirconia"). Five chatbots (ChatGPT-5, ChatGPT-4o, Gemini-2.5 Flash, DeepSeek-V3, and Microsoft Copilot) were queried independently, and responses were anonymized and evaluated. Accuracy was rated on a 5-point Likert scale, reliability using a modified DISCERN tool, quality with the Global Quality Scale (GQS), and readability using the Flesch Reading Ease Score (FRES). Statistical analyses included Mann-Whitney U and Kruskal-Wallis tests, with intraclass correlation coefficients (ICC) used for inter-rater reliability. Results Inter-rater agreement was strong (ICC: 0.78-0.98). Gemini achieved the highest scores in accuracy, quality, and reliability (p < 0.001), while ChatGPT-4o, ChatGPT-5, and DeepSeek demonstrated superior readability. Microsoft Copilot scored lowest across domains, particularly in reliability and readability. No significant differences emerged between prosthodontic and pediatric evaluations, except for higher GQS ratings for DeepSeek in pediatric dentistry (p = 0.035). Conclusion Gemini showed the highest accuracy, reliability, and quality, indicating its strong potential for clinician use in generating evidence-aligned information. ChatGPT-4o, ChatGPT-5, and DeepSeek offered more readable outputs suitable for explanations. Given the substantial between-platform variability, clinicians should critically appraise and, when necessary, adapt chatbot responses to ensure alignment with current evidence before recommending them to patients.