论文标题

教学活动预期的跨模式对比度蒸馏

Cross-modal Contrastive Distillation for Instructional Activity Anticipation

论文作者

Yang, Zhengyuan, Liu, Jingen, Huang, Jing, He, Xiaodong, Mei, Tao, Xu, Chenliang, Luo, Jiebo

论文摘要

在这项研究中,我们旨在预测对过去的观察并研究教学活动预期任务的合理未来行动步骤。与以前的预期任务旨在采用行动标签预测,我们的工作目标是生成自然语言输出,以提供对未来行动步骤的可解释和准确描述。由于缺乏从教学视频中提取的语义信息,这是一项具有挑战性的任务。为了克服这一挑战,我们提出了一个新颖的知识蒸馏框架,以利用相关的外部文本知识来帮助视觉预期任务。但是,以前的知识蒸馏技术通常以同一模式传输信息。为了弥合蒸馏过程中视觉和文本方式之间的差距,我们设计了一种新型的跨模式对比度蒸馏(CCD)方案,该方案促进了在异质方式上促进教师和学生之间的知识蒸馏,并促进了拟议的跨模式蒸馏损失。我们在美味视频数据集上评估了我们的方法。 CCD在BLEU4中相对较大的40.2%将视觉主导学生模型的预期表现提高了。我们的方法还优于最先进的方法。

In this study, we aim to predict the plausible future action steps given an observation of the past and study the task of instructional activity anticipation. Unlike previous anticipation tasks that aim at action label prediction, our work targets at generating natural language outputs that provide interpretable and accurate descriptions of future action steps. It is a challenging task due to the lack of semantic information extracted from the instructional videos. To overcome this challenge, we propose a novel knowledge distillation framework to exploit the related external textual knowledge to assist the visual anticipation task. However, previous knowledge distillation techniques generally transfer information within the same modality. To bridge the gap between the visual and text modalities during the distillation process, we devise a novel cross-modal contrastive distillation (CCD) scheme, which facilitates knowledge distillation between teacher and student in heterogeneous modalities with the proposed cross-modal distillation loss. We evaluate our method on the Tasty Videos dataset. CCD improves the anticipation performance of the visual-alone student model by a large margin of 40.2% relatively in BLEU4. Our approach also outperforms the state-of-the-art approaches by a large margin.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源