论文标题
场景文本识别的蒙版视觉变压器
Masked Vision-Language Transformers for Scene Text Recognition
论文作者
论文摘要
场景文本识别(STR)使计算机能够在各种现实世界中识别和读取文本。除了考虑视觉提示之外,还通过获取语言信息而受益于最近的STR模型。我们提出了一种新颖的掩盖视觉变压器(MVLT),以捕获显式和隐式语言信息。我们的编码器是一个视觉变压器,我们的解码器是多模式变压器。 MVLT在两个阶段进行了训练:在第一阶段,我们设计了一种基于掩盖策略的训练预处理方法;在第二阶段,我们调整了模型,并采用迭代校正方法来提高性能。 MVLT与几个基准上的最先进的STR模型相比,获得了优越的结果。我们的代码和模型可在https://github.com/onealwj/mvlt上找到。
Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.