A comparative evaluation of the quality of responses provided by different large language model chatbots to frequently asked questions regarding nerve blocks

Tulgar, Serkan; AKSU, CAN; Selvi, Onur; Sultan, Pervez; Dogan, Alper; YÖRÜKOĞLU, HADİ; Thomas, David; AHISKALIOĞLU, Ali

doi:10.1186/s12871-025-03596-9

A comparative evaluation of the quality of responses provided by different large language model chatbots to frequently asked questions regarding nerve blocks

Tulgar S., AKSU C., Selvi O., Sultan P., Dogan A. T., YÖRÜKOĞLU H. U., ...Daha Fazla

BMC Anesthesiology, cilt.26, sa.1, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 26 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1186/s12871-025-03596-9
Dergi Adı: BMC Anesthesiology
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, MEDLINE, Directory of Open Access Journals
Atatürk Üniversitesi Adresli: Evet

Özet

Study objective: Large language models (LLMs) are used in all areas of life and have become one of the information sources for those seeking healthcare. Although ChatGPT is the most well-known, Claude, CoPilot, and GEMINI are also among the other LLMs. Some of these models have been studied in terms of their response quality metrics to frequently asked questions (FAQs) about broad content areas like anesthesia and to specific FAQs related to obstetric analgesia. However, no studies have yet been conducted on questions related to nerve blocks. In this study, we evaluated the quality of the answers given by the four LLMs to frequently asked questions related to ‘nerve block’. Design: Prospective, Delphi study, Survey. Intervention: Ten FAQs were identified and presented to four LLMs. A Delphi study was conducted to develop an assessment tool. A survey study was then conducted using the developed tool, in which the evaluators, selected through a thorough process, evaluated the LLM responses. Measurements: The quality of LLM responses was assessed by raters using the ARQuAT (Assessing Response Quality in AI Texts) tool, determined through Delphi rounds. Evaluation criteria included content criteria such as accuracy, comprehensiveness, security, timeliness, and relevance, as well as communication criteria such as understandability, empathy, ethical considerations, readability, and neutrality. Main results: ChatGPT and Claude demonstrated superior performance in ARQuAT-Overall scores compared to GEMINI and CoPilot (p < 0.001). ChatGPT and Claude achieved satisfaction rates above 80% in both content and communication quality metrics, significantly outperforming GEMINI (p < 0.001 for both comparisons), while CoPilot showed intermediate performance. Conclusion: Responses to FAQs related to nerve blocks were well and acceptably addressed by ChatGPT, Claude, and, to a lesser extent, CoPilot. GEMINI performed poorly compared to the others, exhibiting subpar performance on several questions, particularly in terms of safety and relevance.