论文标题
临床部署后深度学习模型的性能恶化:一种自动分割的案例研究,用于确定的前列腺癌放射疗法
Performance Deterioration of Deep Learning Models after Clinical Deployment: A Case Study with Auto-segmentation for Definitive Prostate Cancer Radiotherapy
论文作者
论文摘要
我们评估了基于深度学习(DL)的人工智能(AI)模型的时间性能,用于前列腺放射疗法中的自动分割,试图将其功效与临床景观的变化相关联。我们的研究涉及1328名前列腺癌患者,他们从2006年1月至2022年8月在德克萨斯大学西南医学中心接受了确切的放疗。我们从2006年至2011年培训了基于UNET的细分模型,并在2012年至2022年对数据进行了测试,以模拟现实世界的临床部署。我们使用骰子相似性系数(DSC)测量了模型性能,使用指数加权的移动平均值(EMA)曲线可视化轮廓质量的趋势。此外,我们进行了Wilcoxon秩和测试,以分析不同时期内DSC分布的差异,并进行了多个线性回归,以研究各种临床因素的影响。该模型在初始阶段(从2012年到2014年)表现出峰值性能,用于分割前列腺,直肠和膀胱。但是,我们观察到2015年后前列腺和直肠的性能下降,而膀胱轮廓质量保持稳定。影响前列腺轮廓质量的关键因素包括医师轮廓样式,使用各种水凝胶垫片,CT扫描切片厚度,MRI引导轮廓以及使用静脉内(IV)对比度。直肠轮廓质量受诸如切片厚度,医师轮廓样式以及使用各种水凝胶垫片等因素的影响。膀胱轮廓质量主要通过使用静脉对比度影响。这项研究强调了在动态临床环境中维持AI模型性能一致性方面的挑战。它强调了对AI模型进行持续监视和更新的需求,以确保它们在患者护理中的持续有效性和相关性。
We evaluated the temporal performance of a deep learning (DL) based artificial intelligence (AI) model for auto segmentation in prostate radiotherapy, seeking to correlate its efficacy with changes in clinical landscapes. Our study involved 1328 prostate cancer patients who underwent definitive radiotherapy from January 2006 to August 2022 at the University of Texas Southwestern Medical Center. We trained a UNet based segmentation model on data from 2006 to 2011 and tested it on data from 2012 to 2022 to simulate real world clinical deployment. We measured the model performance using the Dice similarity coefficient (DSC), visualized the trends in contour quality using exponentially weighted moving average (EMA) curves. Additionally, we performed Wilcoxon Rank Sum Test to analyze the differences in DSC distributions across distinct periods, and multiple linear regression to investigate the impact of various clinical factors. The model exhibited peak performance in the initial phase (from 2012 to 2014) for segmenting the prostate, rectum, and bladder. However, we observed a notable decline in performance for the prostate and rectum after 2015, while bladder contour quality remained stable. Key factors that impacted the prostate contour quality included physician contouring styles, the use of various hydrogel spacer, CT scan slice thickness, MRI-guided contouring, and using intravenous (IV) contrast. Rectum contour quality was influenced by factors such as slice thickness, physician contouring styles, and the use of various hydrogel spacers. The bladder contour quality was primarily affected by using IV contrast. This study highlights the challenges in maintaining AI model performance consistency in a dynamic clinical setting. It underscores the need for continuous monitoring and updating of AI models to ensure their ongoing effectiveness and relevance in patient care.