简单并不容易：TextVQA和TextCaps的简单强大基线

论文标题

简单并不容易：TextVQA和TextCaps的简单强大基线

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

论文作者

Zhu, Qi, Gao, Chenyu, Wang, Peng, Wu, Qi

论文摘要

每天都可以通过OCR（光学特征识别）工具识别的文本包含大量信息，例如街道名称，产品品牌和价格。两项任务 - 基于文本的视觉问题回答和基于文本的图像字幕，并带有现有视觉语言应用程序的文本扩展，正在迅速捕捉。为了解决这些问题，正在使用许多复杂的多模式编码框架（例如异质图结构）。在本文中，我们认为一个简单的注意机制可以做到相同甚至更好的工作，而无需任何铃铛和哨声。在这种机制下，我们只是将OCR代币特征分为单独的视觉和语言意见分支，然后将它们发送到流行的变压器解码器中以生成答案或字幕。令人惊讶的是，我们发现这个简单的基线模型相当强 - 它始终在两个流行的基准测试基准（TextVQA）和ST-VQA的所有三个任务上优于最先进的模型（SOTA）模型，尽管这些SOTA模型使用了更复杂的编码机制。将其转移到基于文本的图像字幕上，我们还超越了TextCaps Challenge 2020获奖者。我们希望这项工作为这两个与OCR文本相关的应用程序设置新的基线，并激发多模式编码器设计的新思考。代码可从https://github.com/zephyrzhuqi/ssbaseline获得

Texts appearing in daily scenes that can be recognized by OCR (Optical Character Recognition) tools contain significant information, such as street name, product brand and prices. Two tasks -- text-based visual question answering and text-based image captioning, with a text extension from existing vision-language applications, are catching on rapidly. To address these problems, many sophisticated multi-modality encoding frameworks (such as heterogeneous graph structure) are being used. In this paper, we argue that a simple attention mechanism can do the same or even better job without any bells and whistles. Under this mechanism, we simply split OCR token features into separate visual- and linguistic-attention branches, and send them to a popular Transformer decoder to generate answers or captions. Surprisingly, we find this simple baseline model is rather strong -- it consistently outperforms state-of-the-art (SOTA) models on two popular benchmarks, TextVQA and all three tasks of ST-VQA, although these SOTA models use far more complex encoding mechanisms. Transferring it to text-based image captioning, we also surpass the TextCaps Challenge 2020 winner. We wish this work to set the new baseline for this two OCR text related applications and to inspire new thinking of multi-modality encoder design. Code is available at https://github.com/ZephyrZhuQi/ssbaseline

下载PDF全文

下载文献需遵守相关版权规定

论文标题