场景文本识别的蒙版视觉变压器

论文标题

场景文本识别的蒙版视觉变压器

Masked Vision-Language Transformers for Scene Text Recognition

论文作者

Wu, Jie, Peng, Ying, Zhang, Shengming, Qi, Weigang, Zhang, Jian

论文摘要

场景文本识别（STR）使计算机能够在各种现实世界中识别和读取文本。除了考虑视觉提示之外，还通过获取语言信息而受益于最近的STR模型。我们提出了一种新颖的掩盖视觉变压器（MVLT），以捕获显式和隐式语言信息。我们的编码器是一个视觉变压器，我们的解码器是多模式变压器。 MVLT在两个阶段进行了训练：在第一阶段，我们设计了一种基于掩盖策略的训练预处理方法；在第二阶段，我们调整了模型，并采用迭代校正方法来提高性能。 MVLT与几个基准上的最先进的STR模型相比，获得了优越的结果。我们的代码和模型可在https://github.com/onealwj/mvlt上找到。

Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题