Multimodal AI Chatbot Presents Mixed Results in Ophthalmic Imaging Analysis

Published on: March 1, 2024

The newest version of ChatGPT accurately responded to most multiple-choice questions on ophthalmic cases but performed better on non-image–based questions.

According to a new cross-sectional study, the newest version of the artificial intelligence chatbot ChatGPT-4 accurately responded to approximately two-thirds of image-based multiple-choice questions in a publicly available dataset of ophthalmic cases.¹

However, the large language model (LLM) responded correctly more often on questions that did not rely on ophthalmic image interpretation (82%) than the image-based questions (65%). Stratified by specialty, the chatbot realized its best performance in retina cases and worst in neuro-ophthalmology cases.

“As multimodal LLMs become increasingly widespread, it remains imperative to continuously stress their appropriate use in medicine and highlight concerns surrounding confidentiality and bioethics,” wrote the investigative team led by Rajeev H. Muni, MD, MSc, department of ophthalmology, St Michael’s Hospital Unity Health Toronto.

Recent evidence has indicated the potentially transformative nature of AI chatbots in medicine, particularly in ophthalmology, to ease the burden on healthcare professionals, from patient education to remote monitoring of eye diseases.² Much like any new technology, however, there is a need to address regulatory compliance, privacy, and integration of AI into healthcare systems before the die is cast.

Prior investigations from Muni and colleagues found a previous version of ChatGPT-4, limited to text-based prompts, improved its performance at an impressive rate in medical and ophthalmic settings.³ As ophthalmology relies on the interpretation of multimodal imaging to confirm diagnostic accuracy, the team noted this new ability of the chatbot to interpret ophthalmic images could be critical for reaching that next stage.¹

“The new release of the chatbot holds great potential in enhancing the efficiency of ophthalmic image interpretation, which may reduce the workload on clinicians, mitigate variability in interpretations and errors, and ultimately, lead to improved patient outcomes,” they wrote.

The cross-sectional analysis used publicly available data from the OCTCases medical education platform based at the investigators’ center in Canada. Each case is organized into retina, neuro-ophthalmology, uveitis, glaucoma, ocular oncology, and pediatric ophthalmology. All multiple-choice questions across all available ophthalmic cases on the platform were examined for analysis.

Muni and colleagues created a new ChatGPT Plus account to confirm a lack of previous conversation history with the LLM before study initiation. The LLM account was granted multimodal capability by OpenAI, the chatbot’s parent organization, and all relevant cases and imaging were inputted from October 16 to October 2023, 2023. Chatbot accuracy, measured as the proportion of correct responses, for image recognition, was utilized as the analysis’ primary end point.

Overall, the analysis consisted of 136 cases with 448 images on OCTCases. Among these cases, 429 cases were formatted as multiple-choice questions (82%) and made the statistical analysis. Across these cases, 125 were accompanied by optical coherence tomography (OCT) scans (92%) and 82 cases by fundus images (60%).

Upon analysis, Muni and colleagues found ChatGPT-4 answered 299 of the multiple-choice questions correctly across all ophthalmic cases (70%). The LLM’s performance was best on questions related to retina (77%) and worst in the neuro-ophthalmology category (58%) (difference, 18% [95% CI, 7.5–29.4]; P <.001).

It exhibited intermediate performance on questions from other ocular specialties, including the ocular oncology (72%), pediatric ophthalmology (68%), uveitis (67%), and glaucoma (61%) categories.

Across 303 multiple-choice questions requiring image interpretation, ChatGPT-4 answered 196 questions correctly (65%). Among 126 nonimage-based questions, the score was higher, with 103 correct answers (82%). Overall, the chatbot exhibited better performance on non-imaged based questions (difference, 17% [95% CI, 7.8 - 25.1]; P <.001), but particularly in the pediatric ophthalmology category (difference, 47% [95% CI, 8.5 - 69.0]; P = .02).

Muni and colleagues indicated future analyses should focus on the chatbot’s ability to interpret different ophthalmic imaging modalities, to learn when it becomes as accurate as specific machine learning systems in ophthalmology.

“As the chatbot’s accuracy increases with time, it may develop the potential to inform clinical decision-making in ophthalmology via real-time analysis of ophthalmic cases,” Muni and colleagues wrote.

References

_{Mihalache A, Huang RS, Popovic MM, et al. Accuracy of an Artificial Intelligence Chatbot’s Interpretation of Clinical Ophthalmic Images. JAMA Ophthalmol. Published online February 29, 2024. doi:10.1001/jamaophthalmol.2024.0017}
_{Tan TF, Thirunavukarasu AJ, Jin L, Lim J, Poh S, Teo ZL, Ang M, Chan RVP, Ong J, Turner A, Karlström J, Wong TY, Stern J, Ting DS. Artificial intelligence and digital health in global eye health: opportunities and challenges. Lancet Glob Health. 2023 Sep;11(9):e1432-e1443. doi: 10.1016/S2214-109X(23)00323-6. PMID: 37591589.}
_{Iapoce C. Artificial Intelligence Chatbot appears to improve on Ophthalmic Knowledge Assessment. HCP Live. July 18, 2023. Accessed March 1, 2024. https://www.hcplive.com/view/artificial-intelligence-chatbot-appears-improve-ophthalmic-knowledge-assessment.}