论文标题
M&M混合:多模式多视图变压器合奏
M&M Mix: A Multimodal Multiview Transformer Ensemble
论文作者
论文摘要
该报告描述了我们对2022年Epic-Kitchens Action识别挑战的获胜解决方案背后的方法。我们的方法基于我们最近的工作,视频识别(MTV)的多视图变压器,并将其适应多模式输入。我们的最终提交由多模式MTV(M&M)模型的合奏组成,它改变了主链尺寸和输入方式。我们的方法在动作类中的测试集上达到了52.8%的TOP-1准确性,比去年的获胜参赛作品高4.1%。
This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs. Our final submission consists of an ensemble of Multimodal MTV (M&M) models varying backbone sizes and input modalities. Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.