ChatGPT May Be Useful, Reliable Tool for Healthcare Professionals in IBD, Study Finds

Published on: December 4, 2023

Investigators ranked ChatGPT responses to healthcare professional-oriented questions about IBD as more useful and reliable than the model’s responses to patient-oriented questions.

ChatGPT may be a useful and reliable resource for healthcare professionals in the context of inflammatory bowel disease (IBD), according to findings from a recent study.

When asked a set of 20 questions patient- and healthcare professional-oriented questions about ulcerative colitis (UC) and Crohn disease (CD), the large language model exhibited the strongest reliability and usefulness in the context of disease classification, diagnosis, activity, poor prognostic indicators, and complications.¹

“Recently, numerous articles focusing on AI and [large language models] have been published in the field of gastroenterology. In these publications, the predominant focus has been on asking general health questions related to gastrointestinal diseases,” wrote investigators.¹“In spite of the numerous articles related [large language models] in the context of gastroenterology, we were unable to find studies that pertain specifically to IBD.”

A large language model developed by OpenAI, ChatGPT uses deep learning techniques to produce human-like responses to natural language inputs across a broad spectrum of prompts. Its use in healthcare and medical domains is still being explored and refined, but many experts believe it will be a promising tool for both patients and healthcare professionals. However, its current reliability and usefulness in the context of gastroenterology and IBD have not been comprehensively determined.²

To assess the reliability and usefulness of ChatGPT for patients and healthcare professionals in the context of IBD, Rasim Eren Cankurtaran, MD, of the department of gastroenterology at Ankara Etlik City Hospital in Turkey, and colleagues devised 20 questions related to UC and CD to input into the "prompt" section of the ChatGPT AI chatbot. A set of 10 questions was created for each disease, the first 5 of which were created to reflect the most frequently asked questions by patients for both diseases. The second 5 were directed toward healthcare professionals.¹

To create the patient-oriented questions, investigators conducted separate Google Trends searches to identify the top 5 most commonly searched keywords related to UC and CD. Trends in search terms were identified based on global results between 2004 and the present day in the health subcategory. Questions were devised based on these keywords, covering aspects including the nature of the disease, its causes, symptoms, treatment, and diet.¹

A committee of 4 gastroenterologists led by an expert gastroenterologist worked together to create 5 healthcare professional-oriented questions for both diseases. Questions were related to the classification, diagnosis, activity, poor prognostic indicators, and complications of each disease.¹

The questions devised for both diseases were plugged into the "prompt" section of the ChatGPT AI chatbot. Different users rewrote each question in separate sessions to ensure responses were not influenced by the previous question or response. The answers given by ChatGPT-4 were obtained from the March 14th premium version. Each chatbot was graded by 2 investigators on a 1-7 scale for reliability and usefulness.¹

The interrater Cronbach's α values for both reliability (α = 0.931 for UC and α = 0.936 for CD) and usefulness (α = 0.986 for UC and α = 0.925 for CD) scores showed very strong agreement. The greatest reliability score, scored as a 7 by both raters, was for CD classification. The greatest usefulness score, scored as a 7 by one rater and a 6 by the other, was for UC poor prognostic factors and CD classification.¹

Both raters assigned the lowest reliability score to questions related to UC causes and diagnosis and CD treatment. Similarly, they gave the lowest usefulness score to questions about UC diet, treatment, and diagnosis and CD treatment.¹

Further analysis revealed the reliability scores of ChatGPT’s answers to the professional-oriented questions were significantly greater than those for the patient-oriented questions (P = .032). There was no significant difference between the groups for usefulness scores (P = .052 for professionals, P = .217 for patients).¹

Investigators classified the distribution of the reliability and utility scores into 4 groups based on disease and question source by averaging the mean scores from both raters. Results showed the greatest scores in both reliability (5.00; Standard deviation [SD], 1.21) and usefulness (5.15; SD, 1.08) were obtained from professional sources. The lowest scores were from patient-derived responses (4.00; SD, 1.07 for reliability and 4.35; SD, 1.18 for usefulness). Based on the disease, CD questions were ranked as more reliable (4.70; SD, 1.26) and useful (4.75; SD, 1.06) than UC questions (4.40; SD, 1.21 for reliability and 4.55; SD, 1.31 for usefulness).¹

“Despite its capacity for reliability and usefulness in the context of IBD, ChatGPT still has some limitations and deficiencies. The correction of ChatGPT's deficiencies and its enhancement by developers with more detailed and up-to-date information could make it a significant source of information for both patients and medical professionals,” investigators concluded.¹

References:

Cankurtaran R, Polat Y, Aydemir N, et al. Reliability and Usefulness of ChatGPT for Inflammatory Bowel Diseases: An Analysis for Patients and Healthcare Professionals. Cureus 15(10): e46736. doi:10.7759/cureus.46736
Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. Published 2023 May 4. doi:10.3389/frai.2023.1169595