一个简单的基于变压器的模型，用于EGO4D自然语言查询挑战

论文标题

一个简单的基于变压器的模型，用于EGO4D自然语言查询挑战

A Simple Transformer-Based Model for Ego4D Natural Language Queries Challenge

论文作者

Mo, Sicheng, Mu, Fangzhou, Li, Yin

论文摘要

该报告描述了badgers@UW-Madison，这是我们对EGO4D自然语言查询（NLQ）挑战的提交。我们的解决方案从我们先前在时间动作本地化方面的工作中继承了基于点的事件表示，并为视频接地开发了基于变压器的模型。此外，我们的解决方案集成了几个强大的视频功能，包括慢速，杂食和EGOVLP。没有铃铛和哨子，我们的提交基于单个模型的提交就达到了12.64％的平均R@1，并在公共排行榜上排名第二。同时，我们的方法在TIOU = 0.3（0.5）时获得了28.45％（18.03％）R@5，超过了最高排名的解决方案，最多可达5.5个绝对百分点。

This report describes Badgers@UW-Madison, our submission to the Ego4D Natural Language Queries (NLQ) Challenge. Our solution inherits the point-based event representation from our prior work on temporal action localization, and develops a Transformer-based model for video grounding. Further, our solution integrates several strong video features including SlowFast, Omnivore and EgoVLP. Without bells and whistles, our submission based on a single model achieves 12.64% Mean R@1 and is ranked 2nd on the public leaderboard. Meanwhile, our method garners 28.45% (18.03%) R@5 at tIoU=0.3 (0.5), surpassing the top-ranked solution by up to 5.5 absolute percentage points.

下载PDF全文

下载文献需遵守相关版权规定

论文标题