OR WAIT null SECS
Article intelligence chatbots, including ChatGPT and Bard, varied in accuracy, comprehensiveness, and readability on glaucoma questions adapted from patient brochures.
A new evaluation measured the strengths and limitations of the responses of multiple artificial intelligence (AI) chatbots, including ChatGPT, Bing, and Bard, in answering glaucoma-related patient questions.1
Overall, AI chatbot response accuracy was below that of the American Academy of Ophthalmology (AAO) glaucoma-related patient education brochures. However, the accuracy scores of Bing and ChatGPT were only slightly behind the handout, suggesting their utility for glaucoma patient education.
“With an estimated 43% of glaucoma patients utilizing the Internet for medical information, it is essential for physicians to understand both the information AI chatbots provide and how it is provided, such that physicians are best equipped to guide patient education and improve patient adherence,” wrote the investigative team, led by Natasha N. Kolomeyer, MD, Glaucoma Research Center, Wills Eye Hospital.
Visual outcomes of glaucoma are contingent on adherence to treatment regimens, a factor often influenced by patient education.2 Eye care specialists are typically a source of education, but nearly half of patients with glaucoma have utilized the Internet for medical information. Quality and harm of online medical resources can vary widely across resources and patients can become inundated with information.3
AI-based chatbots represent a new avenue for a direct and interactive form of acquiring medical information online.4 However, as AI-based chatbots could serve as patients' primary source of medical information, it is essential to characterize the information they provide to allow providers to modify discussions, anticipate patient concerns, and identify misleading information.1
Led by Kolomeyer, the study compared glaucoma information responses between available chatbots, including ChatGPT-4 by OpenAI, Bard by Google, and Bing by Microsoft, by measuring response accuracy, comprehensiveness, readability, word count, and character count, and in comparison, to AAO patient materials. Section headers from the most recent AAO glaucoma-related patient education brochures were transformed into question form, suitable for input into AI chatbots.
In April 2023, 19 questions were input into three AI chatbots 5 times each to generate five unique glaucoma response sets. A team of three independent glaucoma fellowship-trained ophthalmologists assessed the accuracy and comprehensiveness of the AAO brochure information and AI chatbot responses on a 1–5 scale.
Upon analysis, the accuracy scores for AAO, ChatGPT, Bing, and Bard were identified as 4.84, 4.26, 4.53, and 3.52, respectively. On direct comparison, AAO was found to be more accurate than ChatGPT (P = .002) and Bard (P <.001), ChatGPT was more accurate than Bard (P = .002), and Bing was more accurate than Bard (P = .001).
Moreover, ChatGPT demonstrated the most comprehensive responses (versus Bing, P <.001; versus Bard, P = .008) compared with the other chatbots. The scores for ChatGPT, Bing, and Bard were 3.32, 2.16, and 2.79, respectively. Responses from the Bing chatbot demonstrated the lowest word and character counts (all P <.0001).
Regarding readability, assessed using the Flesch-Kincaid Grade Level, AAO brochure information and Bard responses were at the most accessible readability levels (versus all other chatbots, P <.0001). The readability levels for AAO, ChatGPT, Bing, and Bard were at 8.11, 13.01, 11.73, and 7.90, respectively.
Although the accuracy of responses is crucial for a patient education resource, Kolomeyer and colleagues pointed to insufficient results as a boon that does not equip patients with relevant and necessary disease-specific information. Instead, it could provide a false assurance that leads them to opt out of speaking with their physician.
Particularly, the team pointed to the comprehensiveness score of ChatGPT suggesting superiority over the AAO brochures, despite the brochures including critical concepts that patients may not independently input into chatbots. As AI chatbots only answer what is asked, Kolomeyer and colleagues noted the potential for fewer questions, and ultimately, less disease-specific information being obtained by patients.
“AI developers can improve glaucoma-related chatbot responses by improving readability and reducing inaccuracies with the use of more accurate online sources and glaucoma specialists,” they wrote. “With improvements, AI chatbots may be a useful supplementary source of glaucoma information to enhance patient education in the future.”
References