Assessing the accuracy, reliability, quality, and readability of artificial intelligence chatbots in patient education: insights from zirconia crowns

Acar, Nihan; ŞENGÜL, Fatih; Bardakci, Enes; ÇELİKEL, Periş

doi:10.1186/s12903-026-07884-9

Assessing the accuracy, reliability, quality, and readability of artificial intelligence chatbots in patient education: insights from zirconia crowns

Acar N. K., ŞENGÜL F., Bardakci E., ÇELİKEL P.

BMC ORAL HEALTH, cilt.26, sa.1, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 26 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1186/s12903-026-07884-9
Dergi Adı: BMC ORAL HEALTH
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE, Directory of Open Access Journals
Atatürk Üniversitesi Adresli: Evet

Özet

Background This study aims to conduct a comparative evaluation of the accuracy, reliability, quality, and readability of chatbot-generated responses from five widely used artificial intelligence (AI) chatbots when addressing frequently asked questions (FAQs) on zirconia and pediatric zirconia crowns. Methods Twenty FAQs on zirconia crowns were derived from two Google searches ("frequently asked questions about zirconia" and "frequently asked questions about pediatric zirconia"). Five chatbots (ChatGPT-5, ChatGPT-4o, Gemini-2.5 Flash, DeepSeek-V3, and Microsoft Copilot) were queried independently, and responses were anonymized and evaluated. Accuracy was rated on a 5-point Likert scale, reliability using a modified DISCERN tool, quality with the Global Quality Scale (GQS), and readability using the Flesch Reading Ease Score (FRES). Statistical analyses included Mann-Whitney U and Kruskal-Wallis tests, with intraclass correlation coefficients (ICC) used for inter-rater reliability. Results Inter-rater agreement was strong (ICC: 0.78-0.98). Gemini achieved the highest scores in accuracy, quality, and reliability (p < 0.001), while ChatGPT-4o, ChatGPT-5, and DeepSeek demonstrated superior readability. Microsoft Copilot scored lowest across domains, particularly in reliability and readability. No significant differences emerged between prosthodontic and pediatric evaluations, except for higher GQS ratings for DeepSeek in pediatric dentistry (p = 0.035). Conclusion Gemini showed the highest accuracy, reliability, and quality, indicating its strong potential for clinician use in generating evidence-aligned information. ChatGPT-4o, ChatGPT-5, and DeepSeek offered more readable outputs suitable for explanations. Given the substantial between-platform variability, clinicians should critically appraise and, when necessary, adapt chatbot responses to ensure alignment with current evidence before recommending them to patients.