Evaluation of large language models for clinical sign-based oral assessment in dogs compared with veterinary practitioners

KOCAMAN, YAKUP; Yanmaz, Latif; OKUR, Sıtkıcan; TURGUT, FERDA; Suzak Kocaman, Irem; ASLAN CANATAN, VİLDAN; GÖLGELİ BEDİR, Ayşe; Sahin, Oguzcan

doi:10.1016/j.tvjl.2026.106611

Evaluation of large language models for clinical sign-based oral assessment in dogs compared with veterinary practitioners

KOCAMAN Y., Yanmaz L. E., OKUR S., TURGUT F., Suzak Kocaman I., ASLAN CANATAN V., ...Daha Fazla

Veterinary Journal, cilt.317, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 317
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.tvjl.2026.106611
Dergi Adı: Veterinary Journal
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, EMBASE, MEDLINE
Anahtar Kelimeler: ChatGPT, Claude-Sonnet, Dentistry, Gemini, Gingivitis, Image-analysis
Atatürk Üniversitesi Adresli: Evet

Özet

This study aimed to evaluate the diagnostic performance of Large Language Models (LLMs) in canine oral assessment by comparing their accuracy and agreement with a panel of expert veterinary academicians. Sixty lateral oral photographs from clinical canine cases were evaluated for nine oral clinical signs by novice veterinarians, a panel of expert veterinary academicians, and four LLMs (ChatGPT-5.1, ChatGPT-5.1 Thinking, Claude-Sonnet 4.5, Gemini-Pro 3). Expert consensus served as the reference standard. Intra-model consistency testing demonstrated substantial reliability for Claude-Sonnet 4.5 (90 % full agreement; κ = 0.75), moderate reliability for Gemini-Pro 3 (70 %; κ = 0.57) and ChatGPT-5.1 (50 %; κ = 0.43), and weak reliability for ChatGPT-5.1 Thinking (30 %; κ = 0.33). Overall diagnostic accuracy across evaluators ranged from 33.3 % to 95 %. Novice veterinarians achieved statistically significant agreement in six clinical signs (e.g., calculus κ = 0.577; p < 0.001; pigmentation κ = 0.378; p < 0.001), outperforming all LLMs in diagnostic breadth. Among LLMs, ChatGPT-5.1 and ChatGPT-5.1 Thinking reached significant weak-to-moderate agreement for several clinical signs (e.g., calculus κ = 0.356–0.485; p ≤ 0.001). LLMs outperformed novices only in traumatic lesions (Gemini-Pro 3 κ = 0.290; p = 0.001). All evaluators exhibited chance-level performance for tooth fractures. In conclusion, novice veterinarians demonstrated higher diagnostic consistency and agreement with experts than all evaluated LLMs. Although ChatGPT-5.1 models showed the strongest AI performance, current LLMs remain insufficient for independent diagnostic use in veterinary dentistry.