论文标题
多项选择视频QA的自我监督的预训练和对比表示学习
Self-supervised pre-training and contrastive representation learning for multiple-choice video QA
论文作者
论文摘要
视频问题回答(视频质量请访问)需要对视频和语言方式的细粒度理解,以回答给定的问题。在本文中,我们提出了新颖的培训方案,以通过自我监督的预训练阶段回答多项选择的视频问题,并在主阶段作为辅助学习的主阶段进行了监督的对比学习。在自我监督的预训练阶段,我们将预测正确答案的原始问题形式转换为预测相关问题的原始问题,以提供一个具有更广泛的上下文输入的模型,而没有任何其他数据集或注释。对于在主要阶段进行对比学习,我们为对应于地面真实答案的输入添加了掩盖噪声,并将地面真实答案的原始输入视为正样本,同时将其余作为负样本。通过映射靠近蒙版输入的正样品,我们表明模型性能得到了改善。我们进一步采用本地对齐的关注,以更有效地专注于与给定相应的字幕句子特别相关的视频帧。我们评估了与多项选择视频QA:TVQA,TVQA+和DramaQA有关的高度竞争基准数据集的建议模型。实验结果表明,我们的模型在所有数据集上都达到了最先进的性能。我们还通过进一步的分析来验证我们的方法。
Video Question Answering (Video QA) requires fine-grained understanding of both video and language modalities to answer the given questions. In this paper, we propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learning. In the self-supervised pre-training stage, we transform the original problem format of predicting the correct answer into the one that predicts the relevant question to provide a model with broader contextual inputs without any further dataset or annotation. For contrastive learning in the main stage, we add a masking noise to the input corresponding to the ground-truth answer, and consider the original input of the ground-truth answer as a positive sample, while treating the rest as negative samples. By mapping the positive sample closer to the masked input, we show that the model performance is improved. We further employ locally aligned attention to focus more effectively on the video frames that are particularly relevant to the given corresponding subtitle sentences. We evaluate our proposed model on highly competitive benchmark datasets related to multiple-choice video QA: TVQA, TVQA+, and DramaQA. Experimental results show that our model achieves state-of-the-art performance on all datasets. We also validate our approaches through further analyses.