Veterinary Journal, cilt.317, 2026 (SCI-Expanded, Scopus)
This study aimed to evaluate the diagnostic performance of Large Language Models (LLMs) in canine oral assessment by comparing their accuracy and agreement with a panel of expert veterinary academicians. Sixty lateral oral photographs from clinical canine cases were evaluated for nine oral clinical signs by novice veterinarians, a panel of expert veterinary academicians, and four LLMs (ChatGPT-5.1, ChatGPT-5.1 Thinking, Claude-Sonnet 4.5, Gemini-Pro 3). Expert consensus served as the reference standard. Intra-model consistency testing demonstrated substantial reliability for Claude-Sonnet 4.5 (90 % full agreement; κ = 0.75), moderate reliability for Gemini-Pro 3 (70 %; κ = 0.57) and ChatGPT-5.1 (50 %; κ = 0.43), and weak reliability for ChatGPT-5.1 Thinking (30 %; κ = 0.33). Overall diagnostic accuracy across evaluators ranged from 33.3 % to 95 %. Novice veterinarians achieved statistically significant agreement in six clinical signs (e.g., calculus κ = 0.577; p < 0.001; pigmentation κ = 0.378; p < 0.001), outperforming all LLMs in diagnostic breadth. Among LLMs, ChatGPT-5.1 and ChatGPT-5.1 Thinking reached significant weak-to-moderate agreement for several clinical signs (e.g., calculus κ = 0.356–0.485; p ≤ 0.001). LLMs outperformed novices only in traumatic lesions (Gemini-Pro 3 κ = 0.290; p = 0.001). All evaluators exhibited chance-level performance for tooth fractures. In conclusion, novice veterinarians demonstrated higher diagnostic consistency and agreement with experts than all evaluated LLMs. Although ChatGPT-5.1 models showed the strongest AI performance, current LLMs remain insufficient for independent diagnostic use in veterinary dentistry.