Comparison of AI-generated renal diets by different large language models: a guideline-based evaluation with expert input

AYDIN ÇİL, Mevra; KARATAŞ, Neva; Mustafaoglu, Ozge; SEVİNÇ, Can

doi:10.1186/s12882-026-04764-w

Comparison of AI-generated renal diets by different large language models: a guideline-based evaluation with expert input

AYDIN ÇİL M., KARATAŞ N., Mustafaoglu O., SEVİNÇ C.

BMC Nephrology, cilt.27, sa.1, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 27 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1186/s12882-026-04764-w
Dergi Adı: BMC Nephrology
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, MEDLINE, Directory of Open Access Journals
Anahtar Kelimeler: Artificial intelligence, Chronic kidney disease, Diet planning, Guideline compliance, Hemodialysis, Large language models, Nutrition therapy
Atatürk Üniversitesi Adresli: Evet

Özet

Background: Chronic kidney disease (CKD) represents a major public health concern due to its increasing prevalence worldwide and the substantial burden it places on healthcare systems. Nutritional therapy plays a critical role in slowing the progression of the disease. Recently, large language models (LLMs) such as ChatGPT, Copilot, and Gemini have been introduced for dietary planning in patients with kidney disease. However, the accuracy, reliability, and guideline adherence of the diets generated by these models remain uncertain. This study aimed to evaluate AI-based dietary recommendations in light of clinical guidelines and expert opinions. Methods: Three different artificial intelligence models (ChatGPT-4o, Microsoft Copilot and Google Gemini and Microsoft Copilot) were tested using standardized patient scenarios in both Turkish and English. The models were asked to generate dietary plans for hemodialysis patients, CKD stage 3–5 patients, and a reference diet of 1800 kcal/40 g protein. The dietary outputs obtained were analyzed using the BeBIS 9 software and evaluated with IBM SPSS Statistics 26.0. Energy, macro- and micronutrient contents were compared against the TUBER 2022 (Türkiye Nutrition Guide 2022) and KDOQI (Kidney Disease Outcomes Quality Initiative) guidelines. In addition, five experts scored the models’ recommendations based on accuracy, comprehensiveness, reproducibility, innovation/personalization, and nutritional diversity/practicality. Results: The energy values generated by all models remained below the reference targets. Significant differences were observed in energy content across models and languages (lowest: Gemini-English: 859.11 ± 151.35 kcal vs. highest: Copilot-English:1496.03 ± 249.71 kcal; p < 0.001). Potassium levels varied significantly by model (p < 0.001), with some outputs exceeding safe limits. In expert evaluations for ‘Accuracy’, Gemini-English achieved the highest score (median: 4.00), whereas ChatGPT-Turkish recorded the lowest (median: 0.00). While Gemini stood out in innovation, Copilot-Turkish plans yielded results most closely aligned with the guidelines in terms of sodium and phosphorus. None of the models fully met the guideline requirements. Conclusions: Although AI-based large language models hold potential for dietary planning in patients with kidney disease, they demonstrate inconsistencies in nutrient accuracy, guideline adherence, and personalization. Their standalone use in clinical practice is therefore not appropriate; expert supervision and integration with clinical guidelines are required. Trial registration: Not applicable.