论文标题
Speech2Action:行动识别的跨模式监督
Speech2Action: Cross-modal Supervision for Action Recognition
论文作者
论文摘要
是否可以仅凭对话猜测人类行动?在这项工作中,我们研究了电影中口语和动作之间的联系。我们注意到,电影剧本描述了动作,并包含角色的语音,因此可以用于学习此相关性,而无需其他监督。我们在一千多个电影剧本上训练基于BERT的Speech2Action分类器,以预测转录的语音段的动作标签。然后,我们将此模型应用于一个大型未标记的电影语料库的语音片段(来自288k电影的1.88亿语音段)。使用该模型的预测,我们获得了超过800K视频剪辑的弱动作标签。通过对这些视频剪辑进行培训,我们在不使用单个手动标记的动作示例的情况下展示了在标准动作识别基准上的出色动作识别性能。
Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.