{"doi":"10.1101/2024.04.03.24305088","title":"Evaluating Large Language Models for Drafting Emergency Department Discharge Summaries","abstract":"<h4>Importance</h4> Large language models (LLMs) possess a range of capabilities which may be applied to the clinical domain, including text summarization. As ambient artificial intelligence scribes and other LLM-based tools begin to be deployed within healthcare settings, rigorous evaluations of the accuracy of these technologies are urgently needed. <h4>Objective</h4> To investigate the performance of GPT-4 and GPT-3.5-turbo in generating Emergency Department (ED) discharge summaries and evaluate the prevalence and type of errors across each section of the discharge summary. <h4>Design</h4> Cross-sectional study. <h4>Setting</h4> University of California, San Francisco ED. <h4>Participants</h4> We identified all adult ED visits from 2012 to 2023 with an ED clinician note. We randomly selected a sample of 100 ED visits for GPT-summarization. <h4>Exposure</h4> We investigate the potential of two state-of-the-art LLMs, GPT-4 and GPT-3.5-turbo, to summarize the full ED clinician note into a discharge summary. <h4>Main Outcomes and Measures</h4> GPT-3.5-turbo and GPT-4-generated discharge summaries were evaluated by two independent Emergency Medicine physician reviewers across three evaluation criteria: 1) Inaccuracy of GPT-summarized information; 2) Hallucination of information; 3) Omission of relevant clinical information. On identifying each error, reviewers were additionally asked to provide a brief explanation for their reasoning, which was manually classified into subgroups of errors. <h4>Results</h4> From 202,059 eligible ED visits, we randomly sampled 100 for GPT-generated summarization and then expert-driven evaluation. In total, 33% of summaries generated by GPT-4 and 10% of those generated by GPT-3.5-turbo were entirely error-free across all evaluated domains. Summaries generated by GPT-4 were mostly accurate, with inaccuracies found in only 10% of cases, however, 42% of the summaries exhibited hallucinations and 47% omitted clinically relevant information. Inaccuracies and hallucinations were most commonly found in the Plan sections of GPT-generated summaries, while clinical omissions were concentrated in text describing patients’ Physical Examination findings or History of Presenting Complaint. <h4>Conclusions and Relevance</h4> In this cross-sectional study of 100 ED encounters, we found that LLMs could generate accurate discharge summaries, but were liable to hallucination and omission of clinically relevant information. A comprehensive understanding of the location and type of errors found in GPT-generated clinical text is important to facilitate clinician review of such content and prevent patient harm.","journal":null,"year":2024,"id":367,"datarank":0.4943755299006494,"base_score":3.295836866004329,"endowment":3.295836866004329,"self_citation_contribution":0.4943755299006494,"citation_network_contribution":0.0,"self_endowment_contribution":0.4943755299006494,"citer_contribution":0.0,"corpus_percentile":null,"corpus_rank":null,"citation_count":26,"citer_count":0,"citers_with_citation_signal":0,"citers_with_endowment":0,"datacite_reuse_total":0,"is_dataset":false,"is_dataset_confidence":0.043,"is_oa":true,"file_count":0,"downloads":0,"has_version_chain":false,"published_date":"2024-04-04","fair_score":null,"fair_percentile":null,"algorithm_id":"datarank_citation_only_1hop_v6","ranking_scope":"data_only","authors":[{"id":3413,"name":"Jaskaran Bains","orcid":"0000-0001-7880-5878","position":1,"is_corresponding":false},{"id":3414,"name":"Tianyu Tang","orcid":null,"position":2,"is_corresponding":false},{"id":3415,"name":"Kishan Patel","orcid":null,"position":3,"is_corresponding":false},{"id":3416,"name":"Alexa N. Lucas","orcid":null,"position":4,"is_corresponding":false},{"id":3417,"name":"Fiona Chen","orcid":null,"position":5,"is_corresponding":false},{"id":3067,"name":"Brenda Y. Miao","orcid":"0000-0002-3393-9837","position":6,"is_corresponding":false},{"id":51,"name":"Atul Janardhan Butte","orcid":"0000-0002-7433-2740","position":7,"is_corresponding":false},{"id":3418,"name":"Aaron E. Kornblith","orcid":"0000-0002-1344-575X","position":8,"is_corresponding":false},{"id":3419,"name":"Alexa Lucas","orcid":null,"position":9,"is_corresponding":false},{"id":3412,"name":"Christopher Y. K. Williams","orcid":"0000-0001-8867-1623","position":0,"is_corresponding":true}],"reference_count":28,"raw_metadata":null,"created_at":"2026-03-01T18:20:47.508186Z","pmid":null,"pmcid":null,"fwci":null,"citation_percentile":null,"influential_citations":0,"oa_status":null,"license":null,"views":0,"total_file_size_bytes":0,"version_count":0,"fair_f":null,"fair_a":null,"fair_i":null,"fair_r":null,"fair_zscore":null,"fair_rationale":null,"fair_model":null,"fair_agent_version":null,"fair_fulltext_source":null,"fair_has_llm":null,"fair_computed_at":null,"clinical_trials":[],"software_tools":[],"db_accessions":[],"linked_datasets":[],"topics":[]}