通过视频语义聚合进行手术技能评估

论文标题

通过视频语义聚合进行手术技能评估

Surgical Skill Assessment via Video Semantic Aggregation

论文作者

Li, Zhenqiang, Gu, Lin, Wang, Weimin, Nakamura, Ryosuke, Sato, Yoichi

论文摘要

基于自动视频的手术技能评估是协助年轻的外科学员，尤其是在贫困地区的年轻外科学员，是一项有前途的任务。现有作品通常诉诸CNN-LSTM联合框架，该框架对LSTM的长期关系建模在空间汇总的短期CNN功能上。但是，这种做法不可避免地忽略了空间维度中的语义概念（例如工具，组织和背景）之间的差异，从而阻碍了随后的时间关系建模。在本文中，我们提出了一个新型的技能评估框架，视频语义聚合（Visa），该框架发现了不同的语义部分并将它们汇总在时空维度上。语义部分的明确发现提供了一种解释性的可视化，可帮助理解神经网络的决策。它还使我们能够进一步合并辅助信息，例如运动学数据，以改善表示和性能。与最先进的方法相比，两个数据集的实验显示了签证的竞争力。源代码可在以下网址获得：bit.ly/miccai2022visa。

Automated video-based assessment of surgical skills is a promising task in assisting young surgical trainees, especially in poor-resource areas. Existing works often resort to a CNN-LSTM joint framework that models long-term relationships by LSTMs on spatially pooled short-term CNN features. However, this practice would inevitably neglect the difference among semantic concepts such as tools, tissues, and background in the spatial dimension, impeding the subsequent temporal relationship modeling. In this paper, we propose a novel skill assessment framework, Video Semantic Aggregation (ViSA), which discovers different semantic parts and aggregates them across spatiotemporal dimensions. The explicit discovery of semantic parts provides an explanatory visualization that helps understand the neural network's decisions. It also enables us to further incorporate auxiliary information such as the kinematic data to improve representation learning and performance. The experiments on two datasets show the competitiveness of ViSA compared to state-of-the-art methods. Source code is available at: bit.ly/MICCAI2022ViSA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题