
OR WAIT null SECS
Study findings suggest ChatGPT and Gemini both require further model improvements before relying solely on LLMs in clinical nutrition practice.
Findings from a recent study suggest large language models (LLMs) like ChatGPT and Gemini are not sufficient tools for providing dietary recommendations for patients with irritable bowel syndrome (IBS).1
While ChatGPT provided mostly compliant responses with current dietary guidelines for IBS, study results emphasize the need for further model improvements before relying solely on LLMs in clinical nutrition practice, highlighting the importance of dietitians' recommendations and the collaboration between AI models and healthcare teams.1
Despite recent advancements in public awareness and treatment options, findings from the American Gastroenterological Association’s 2025 IBS in America survey shed light on persistent challenges faced by patients with IBS as well as shifts in patient experiences, health care provider perceptions, and the treatment landscape. Artificial intelligence tools have shown promise for other healthcare-related uses like antimicrobial stewardship in hospital settings, but their utility for IBS remains underexplored.2,3
“While some studies show that the nutrition and health advice provided by LLMs are similar to that of healthcare professionals, others show that they provide inaccurate and nonevidence-based advice,” Merve Kip, a research assistant in the department of nutrition and dietetics at Nuh Naci Yazgan University in Turkey, and colleagues wrote.1 “To our knowledge, there are no studies evaluating the dietary recommendations for IBS given by LLMs.”
To address this gap in research, investigators conducted a cross-sectional comparative study assessing nutrition recommendations for IBS generated by ChatGPT-4o mini and Gemini 1.5 regarding guideline compliance, quality, understandability, actionability, and readability. They additionally assessed the correlation between quality, guideline compliance, and patient education tool scores of LLMs.1
Investigators created a set of 10 frequently asked questions based on the most frequently cited and commonly used guidelines as well as on their clinical experience regarding commonly raised concerns by IBS patients. These questions represented typical patient issues, including which diet to follow for IBS, foods that have a positive/negative effect, the consumption of coffee and tea, ingredients to avoid in packaged foods, allowable fruit intake, appropriate supplements, and what to eat or avoid in cases of constipation, diarrhea, or bloating.1
The free versions of LLMs were used in the study because they were deemed to be the most accessible to patients. Before starting the study, investigators cleared the browser history and all chat history. A new chat window was then opened for each question to avoid any possible context transfer, and each question was asked 4 times every 15 minutes.1
The Guideline Compliance Score was created by researchers, using the most frequently cited and commonly used IBS guidelines. Additionally, they used the Global Quality Score (GQS) and Completeness, Lack of Misinformation, Evidence, Appropriateness, Relevance (CLEAR) tool to analyze the quality of the responses given by the LLMs. All responses were evaluated for their understandability and actionability scores using the Patient Education Materials Assessment Tool (PEMAT), and ease of reading was assessed using Flesch Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE).1
Results showed most answers given by ChatGPT (70%) and Gemini (57.5%) were compliant with the guidelines. ChatGPT provided “compliant” (70%) or “partially compliant” (30%) responses, while Gemini provided “compliant” (57.5%), “partially compliant” (20%), and “Noncompliant” (22.5%) responses to the questions.1
Further analysis revealed ChatGPT had higher mean scores than Gemini for GQS, Guideline Compliance, and PEMAT understandability and actionability scores, but without statistical significance (P > .05). Of all assessed measures, Gemini scored significantly higher than ChatGPT only in the Evidence subscore (P <.001).1
Investigators noted the mean FRE score of Gemini was higher (49.68 ± 4.82) than ChatGPT (46.06 ± 4.52), which corresponded to “difficult to read” for both LLMs (P >.05). Additionally, both ChatGPT and Gemini were classified as ‘very good content’ based on their CLEAR scores (19.29 ± 1.59 and 19.59 ± 4.72, respectively).1
GQS and CLEAR, showed a strong positive correlation (r = 0.611; P = .004). The CLEAR score showed a moderate positive correlation with both PEMAT actionability (r = 0.467; P = .038) and understandability (r = 0.568; P = .009). Readability scores, FRE and FKGL, had a strong negative correlation (r = −0.784; P <.001), while the Guideline Compliance Score showed a moderate negative correlation with FRE (r = −0.537; P = .015).1
“Our study suggests that patients should be aware of the risk of relying on LLMs alone in practice, as they may provide inaccurate or incomplete information,” investigators concluded.1 “This study highlights the further need for model improvements or the development of new LLMs before their use in clinical nutrition practice. With future model refinement and development, AI technology can be used as an assistant tool that is always available to patients rather than replacing the advice of dietitians.”
References