大语言模型在儿童肝移植相关原发病诊断与治疗决策支持中的应用研究

王元浩; 钟成鹏; 吴宇轩; 何康; 夏强

doi:10.12464/j.issn.1674-7445.2026035

大语言模型在儿童肝移植相关原发病诊断与治疗决策支持中的应用研究

Research on the application of large language models in the diagnosis and treatment decision support for primary diseases related to pediatric liver transplantation

摘要

摘要:
目的探讨3种主流大语言模型在儿童肝移植相关原发病的诊断、鉴别诊断及治疗决策支持中的应用价值。
方法收集79例就诊于上海交通大学医学院附属仁济医院的儿童肝移植相关病例或公开发表的高质量病例报告。所有病例均经病理或临床随访资料确诊，涵盖胆汁淤积性肝病、代谢性疾病、肿瘤等25种原发病。采用标准化提示词，将病例信息分别输入DeepSeek-R1、ChatGPT-4o及Grok-3模型，评估其基于基础临床资料的初步诊断与鉴别诊断准确率，完善补充检查后的最终诊断准确率、响应时间，以及其对疾病治疗原则分析的完整性与合理性。
结果在初步诊断与鉴别诊断阶段，DeepSeek-R1的综合准确率最高72.1%，95%可信区间（CI）61.4%~80.8%，3种模型初步诊断综合准确率差异具有统计学意义（P = 0.008）。加入进一步检查信息后3种模型的最终诊断准确率均提升，分别为DeepSeek-R1 88.6% （95% CI 79.7%~93.9%），ChatGPT-4o 87.3%（95% CI 78.2%~93.0%），Grok-3 78.5%（95% CI 68.2%~86.1%），3种模型间差异无统计学意义（P = 0.05）。专家对诊治原则的评分具有良好的一致性（Kappa = 0.769）。此外，ChatGPT-4o的响应时间比另外两种模型短（24 ± 7）s。
结论大语言模型在各类儿童肝病的诊断和治疗决策流程中展现出良好的效能，具有良好的辅助诊断与决策支持应用前景，有望帮助提高儿童肝移植相关原发病临床诊疗的精准性与效率。

Abstract:
Objective To explore the application value of three mainstream large language models in the diagnosis, differential diagnosis, and treatment decision support of the primary diseases related to pediatric liver transplantation.
Methods Seventy-nine cases of pediatric liver transplantation-related diseases diagnosed through pathological or clinical follow-up data were collected from Renji Hospital, Shanghai Jiao Tong University School of Medicine or published high-quality case reports. These cases covered 25 types of primary diseases such as cholestatic liver disease, metabolic diseases, and tumors. Standardized prompts were used to input the case information into the DeepSeek-R1, ChatGPT-4o and Grok-3 models, and the accuracy of their preliminary diagnosis and differential diagnosis based on basic clinical data was evaluated. The final diagnosis accuracy and the response time after supplementary examination were also assessed, as well as the completeness and rationality of their analysis of disease treatment principles.
Results In the initial diagnosis and differential diagnosis stage, the comprehensive accuracy of DeepSeek-R1 was the highest 72.1%, 95% confidence interval (CI) 61.4% - 80.8%, and there was a statistically significant difference in the comprehensive accuracy of the three models for initial diagnosis (P = 0.008). After adding further examination information, the final diagnosis accuracy of the three models increased, with DeepSeek-R1 at 88.6% (95% CI 79.7% - 93.9%), ChatGPT-4o at 87.3% (95% CI 78.2% - 93.0%), and Grok-3 at 78.5% (95% CI 68.2% - 86.1%). There was no statistically significant difference among the three models (P = 0.05). The scores given by experts for the treatment principles showed good consistency (Kappa = 0.769). In addition, the response time of ChatGPT-4o is shorter than that of the other two models (24 ± 7) s.
Conclusions Large language models demonstrate good efficacy in the diagnosis and treatment decision-making process of various pediatric liver diseases, have a good application prospect for auxiliary diagnosis and decision support, and are expected to help improve the accuracy and efficiency of clinical diagnosis and treatment of pediatric liver transplantation-related primary diseases.

HTML全文

参考文献(38)

施引文献

资源附件(0)