Comparative assessment of AI-generated responses to frequently asked questions by parents on space maintainers in pediatric dentistry.

Karadeniz, Hazar; Celikel, Periş; Sengul, Fatih

doi:10.1186/s12903-026-07751-7

Comparative assessment of AI-generated responses to frequently asked questions by parents on space maintainers in pediatric dentistry.

Karadeniz H. B., Celikel P., Sengul F.

BMC ORAL HEALTH, cilt.26, ss.335-342, 2026 (Hakemli Dergi)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 26
Basım Tarihi: 2026
Doi Numarası: 10.1186/s12903-026-07751-7
Dergi Adı: BMC ORAL HEALTH
Sayfa Sayıları: ss.335-342
Atatürk Üniversitesi Adresli: Evet

Özet

Background

Artificial intelligence (AI)–based chatbots are increasingly being used in healthcare services for information access and patient education. However, in pediatric dentistry, evidence regarding the accuracy, reliability, and clinical validity of these systems is limited. This study aimed to evaluate the responses of ChatGPT-3.5, ChatGPT-5, Google Gemini 2.5 Flash, DeepSeek-V3, and Grok 3 about space maintainers in terms of accuracy, reliability, quality, and readability.

Methods

The five chatbots were asked the 20 most frequently asked questions by parents about space maintainers. Responses were evaluated by two pediatric dentistry specialists for accuracy using a 5-point Likert scale, reliability using the Quality Criteria for Consumer Health Information (DISCERN) scale, quality using the Global Quality Scale (GQS), and readability using the Flesch Reading Ease Score (FRES). The Intraclass Correlation Coefficient (ICC) was used to assess agreement between the two specialists. All statistical analyses were performed using SPSS version 26. The Kruskal–Wallis test was applied for Likert, GQS, and DISCERN scores, while one-way ANOVA with post-hoc tests was applied for FRES scores. Statistical significance was set at p < 0.05.

Results

In this study, no statistically significant difference was found in Likert scores among ChatGPT-3.5, ChatGPT-5, Google Gemini 2.5 Flash, DeepSeek-V3, and Grok 3 (p > 0.05). In contrast, a statistically significant difference was observed in DISCERN and GQS scores (p < 0.05). In terms of FRES scores, ChatGPT-3.5, ChatGPT-5, and DeepSeek-V3 demonstrated higher readability, whereas Google Gemini 2.5 Flash and Grok 3 obtained lower scores.

Conclusion

Although AI-based chatbots showed comparable performance in terms of accuracy, they differed significantly in reliability, quality, and readability. These findings suggest that AI chatbots have the potential to provide parents with accurate and understandable information; however, model-specific differences in reliability and readability should be considered.