论文标题
端到端时间动作检测的实证研究
An Empirical Study of End-to-End Temporal Action Detection
论文作者
论文摘要
时间动作检测(TAD)是视频理解中的一项重要但艰巨的任务。它旨在同时预测未修剪视频中每个动作实例的语义标签和时间间隔。大多数现有的方法不是端到端的学习,而是采用了纯粹的学习范式,在该范式中,预先训练了视频编码器进行动作分类,并且仅针对编码器上的检测头才针对TAD进行了优化。端到端学习的效果没有系统地评估。此外,还缺乏关于端到端TAD效率准确性权衡的深入研究。在本文中,我们介绍了端到端时间作用检测的经验研究。我们验证了端到端学习比唯一的学习的优势,并观察到高达11 \%的性能提高。此外,我们研究了影响TAD性能和速度的多种设计选择的效果,包括检测头,视频编码器以及输入视频的分辨率。根据调查结果,我们建立了一个中分辨率的基线检测器,该检测器可实现端到端方法的最新性能,同时运行超过4 $ \ times $ $。我们希望本文可以作为端到端学习的指南,并激发该领域的未来研究。代码和模型可在\ url {https://github.com/xlliu7/e2e-tad}上找到。
Temporal action detection (TAD) is an important yet challenging task in video understanding. It aims to simultaneously predict the semantic label and the temporal interval of every action instance in an untrimmed video. Rather than end-to-end learning, most existing methods adopt a head-only learning paradigm, where the video encoder is pre-trained for action classification, and only the detection head upon the encoder is optimized for TAD. The effect of end-to-end learning is not systematically evaluated. Besides, there lacks an in-depth study on the efficiency-accuracy trade-off in end-to-end TAD. In this paper, we present an empirical study of end-to-end temporal action detection. We validate the advantage of end-to-end learning over head-only learning and observe up to 11\% performance improvement. Besides, we study the effects of multiple design choices that affect the TAD performance and speed, including detection head, video encoder, and resolution of input videos. Based on the findings, we build a mid-resolution baseline detector, which achieves the state-of-the-art performance of end-to-end methods while running more than 4$\times$ faster. We hope that this paper can serve as a guide for end-to-end learning and inspire future research in this field. Code and models are available at \url{https://github.com/xlliu7/E2E-TAD}.