Advertisement

Poor Diagnostic Accuracy, Limited Diversity Observed in AI-Generated Dermatologic Images

Published on: 

This study evaluated a set of key AI outputs: skin tone representation and diagnostic accuracy of generated dermatologic diseases.

New research suggests that notable deficiencies remain present in the diversity and accuracy of artificial intelligence (AI)-generated images in dermatology.1

These findings resulted from a study conducted by Lucie Joerg, an MD candidate at Albany Medical College in New York, who worked alongside a team of trial investigators in this analysis. Joerg and colleagues noted the urgent necessity of addressing AI-related deficiencies in dermatology. They highlighted the value of virtual image rendering precision as vital, given the growing need some patients feel to self-diagnose.2

“Given the rapid adoption of AI technologies in dermatology and the harm of homogeneous, erroneous outputs, this study addresses this knowledge gap by assessing how well popular text-to-image AI programs reflect skin colour diversity and accurately depict skin conditions,” Joerg et al wrote.1

Trial Design and Findings

The investigative team's cross-sectional study was carried out with the aim of assessing the representation of various skin tones. Joerg and coauthors also sought to assess diagnostic accuracy in the AI-generated images of dermatologic conditions. They did so by implementing the prompt “Generate a photo of a person with [skin condition].”

A set of 4 generative AI models—ChatGPT-4o, Adobe Firefly, Midjourney, and Stable Diffusion—were used by the team to look at the 20 most prevalent dermatologic conditions. The searches used by the investigators led to a total of 4,000 images, all of which were identified in the timeframe between June - July 2024. There were 2 independent raters assigned to evaluate the images for skin tone diversity via the Fitzpatrick scale.

These raters would compare distributions against US Census data through the use of a chi-square (χ²) analysis. A randomized subset of 200 images was also examined and reviewed by a pair of blinded dermatology residents, with the aim being diagnostic accuracy determination. Any inter-rater agreements were measured through the use of the kappa statistic.

Joerg et al found that among all images, 89.8% featured patients who were shown to have lighter skin tones, while only 10.2% were shown to have darker skin tones. Notably, the investigative team added that only the Adobe Firefly model's output was found to have closely matched the distribution seen among those in the US population's demographics.

Specifically, this model had 38.1% representation of darker skin tones and it resulted in a non-significant chi-square value (χ²(1) = 0.320, P = .572). Joerg and colleagues highlighted that this would suggest no substantial deviation from expected proportions of demographic data. In contrast, Midjourney, ChatGPT-4o, and Stable Diffusion were found to have significantly underrepresented individuals who have darker skin tones (Fitzpatrick type >IV).

These models generated only 3.9%, 6.0%, and 8.7% of such images, respectively. The team added that all differences attained statistical significance (P < .001). In the image accuracy evaluations carried out during Joerg and coauthors' analysis, only 15% of all included images were shown to have been correctly identified as the intended dermatologic disease. Another notable finding by the investigative team was that Adobe Firefly had the lowest diagnostic accuracy, at just 0.94%.

This was contrasted with the accuracy data seen with ChatGPT-4o, Midjourney, and Stable Diffusion, each of which achieved higher but still inadequate rates of diagnostic accuracy of 22%, 12.2%, and 22.5%, respectively. Overall, the investigators concluded that their research may provide a first-of-its-kind comparative assessment of dermatologic image generation by commonly-implemented AI applications.

Although 3 of the 4 models were noted by Joerg et al as significantly lacking in skin tone diversity in their outputs, Adobe Firefly did manage to attain skin tone representation that the investigators determined was in line with population data. Nevertheless, limited capability in accurately depicting specific skin conditions was observed among these 4 AI models.

“While the potential of generative AI in medicine is undeniable, without prompt action to ensure inclusive and accurate datasets, these technologies risk failing the communities they aim to serve,” the investigators concluded.1 “As AI shapes the future of healthcare, there is a responsibility to uphold fairness and equitable representation. Only through deliberate and diverse design can AI fulfil its promise as a tool for universal health equity.”

References

  1. Joerg L, Kabakova M, Wang JY, et al. AI-generated dermatologic images show deficient skin tone diversity and poor diagnostic accuracy: An experimental study. J Eur Acad Dermatol Venereol. 2025 Jul 16. doi: 10.1111/jdv.20849. Epub ahead of print. PMID: 40668069.
  2. Shahsavar Y, Choudhury A. User intentions to use ChatGPT for self-diagnosis and health-related purposes: cross-sectional survey study. JMIR Hum Factors. 2023; 10:e47564.

Advertisement
Advertisement