Evaluation of large language models for clinical sign-based oral assessment in dogs compared with veterinary practitioners


KOCAMAN Y., Yanmaz L. E., OKUR S., TURGUT F., Suzak Kocaman I., ASLAN CANATAN V., ...Daha Fazla

Veterinary Journal, cilt.317, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 317
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1016/j.tvjl.2026.106611
  • Dergi Adı: Veterinary Journal
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, EMBASE, MEDLINE
  • Anahtar Kelimeler: ChatGPT, Claude-Sonnet, Dentistry, Gemini, Gingivitis, Image-analysis
  • Atatürk Üniversitesi Adresli: Evet

Özet

This study aimed to evaluate the diagnostic performance of Large Language Models (LLMs) in canine oral assessment by comparing their accuracy and agreement with a panel of expert veterinary academicians. Sixty lateral oral photographs from clinical canine cases were evaluated for nine oral clinical signs by novice veterinarians, a panel of expert veterinary academicians, and four LLMs (ChatGPT-5.1, ChatGPT-5.1 Thinking, Claude-Sonnet 4.5, Gemini-Pro 3). Expert consensus served as the reference standard. Intra-model consistency testing demonstrated substantial reliability for Claude-Sonnet 4.5 (90 % full agreement; κ = 0.75), moderate reliability for Gemini-Pro 3 (70 %; κ = 0.57) and ChatGPT-5.1 (50 %; κ = 0.43), and weak reliability for ChatGPT-5.1 Thinking (30 %; κ = 0.33). Overall diagnostic accuracy across evaluators ranged from 33.3 % to 95 %. Novice veterinarians achieved statistically significant agreement in six clinical signs (e.g., calculus κ = 0.577; p < 0.001; pigmentation κ = 0.378; p < 0.001), outperforming all LLMs in diagnostic breadth. Among LLMs, ChatGPT-5.1 and ChatGPT-5.1 Thinking reached significant weak-to-moderate agreement for several clinical signs (e.g., calculus κ = 0.356–0.485; p ≤ 0.001). LLMs outperformed novices only in traumatic lesions (Gemini-Pro 3 κ = 0.290; p = 0.001). All evaluators exhibited chance-level performance for tooth fractures. In conclusion, novice veterinarians demonstrated higher diagnostic consistency and agreement with experts than all evaluated LLMs. Although ChatGPT-5.1 models showed the strongest AI performance, current LLMs remain insufficient for independent diagnostic use in veterinary dentistry.