Assessing the Efficiency of GPT-4 in Simplifying and Summarizing Radiology Reports: A Quantitative and Qualitative Analysis

Poster #: 32
Session/Time: A
Author: Amir Hasani
Mentor: Ashkan Malayeri, MD
Co-Investigator(s): Mahshid Golagha - Radiology Department, Clinical Center, National Institute of Health  Kush Attal - National Medical Library, National Institute of Health  Mahshid Goljamali - School of Pharmacy, Virgnia Commonwealth University  Brian Ondov - National Medical Library, National Institute of Health  Aryan Zahergivar - Radiology Department, Clinical Center, National Institute of Health  Mark, B.A.ll - Urology Department, Clinical Center, National Institute of Health
Research Type: Clinical Research

Abstract

Introduction: The integration of AI and large language models (LLMs) like GPT-4 in medicine aims to simplify complex medical information, making it more accessible for patients and healthcare providers. This study evaluates GPT-4's ability to summarize and simplify radiology reports, focusing on the quality, readability, and clinical usefulness of AI-generated reports compared to those produced by radiologists.

Methods: A total of 150 anonymized abdominal radiology reports were processed through GPT-4 to generate two types of outputs: summarized versions and simplified versions. The original reports served as controls. Evaluations were conducted by six urology fellows and six radiology post-doc fellows, all blinded to the study's objectives. Metrics such as Flesch-Kincaid Grade Level, Gunning Fog Index, BLEU, and ROUGE scores were used alongside human assessments focusing on clarity, language, and usefulness.

Results: The original reports averaged 120.98 words and 576.4 characters, while GPT-4 summarized reports had 24.77 words and 128.75 characters, and simplified reports had 22.12 words and 94.14 characters. Readability scores for GPT-4 outputs were significantly better, with Flesch-Kincaid scores of 0.95 (summarized) and 1.00 (simplified) versus 3.54 for the original reports. BLEU and ROUGE metrics indicated lower text overlap between AI-generated and original reports. Human evaluators rated the summarized reports higher in clarity (3.98 vs. 3.62) and overall preference (3.92 vs. 3.80) compared to simplified versions.

Conclusion: GPT-4 effectively simplifies and summarizes radiology reports, improving readability and clarity, which may enhance patient understanding. However, further refinement is necessary to ensure that these AI-generated reports maintain the accuracy and completeness required for clinical use.