很少有比所有人都更好：特征采样和分组以进行场景文本检测

论文标题

很少有比所有人都更好：特征采样和分组以进行场景文本检测

Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection

论文作者

Tang, Jingqun, Zhang, Wenqing, Liu, Hongye, Yang, MingKun, Jiang, Bo, Hu, Guanglong, Bai, Xiang

论文摘要

最近，基于变压器的方法在对象检测中取得了有希望的进展，因为它们可以消除NMS等后处理并丰富深层表示。但是，由于其尺度和宽高比的极大差异，这些方法无法很好地应对场景文本。在本文中，我们提出了一个简单但有效的基于变压器的架构，用于场景文本检测。与以前学习以整体方式了解场景文本的深层表示的方法不同，我们的方法根据一些代表性特征执行场景文本检测，从而避免了背景的干扰并降低计算成本。具体而言，我们首先在所有尺度上选择一些与前景文本高度相关的代表性功能。然后，我们采用变压器来建模采样特征的关系，从而有效地将它们分为合理的组。由于每个功能组对应于文本实例，因此无需任何后处理操作即可轻松获得其边界框。使用基本功能金字塔网络进行特征提取，我们的方法始终在几个流行的数据集上实现最新的结果，以进行场景文本检测。

Recently, transformer-based methods have achieved promising progresses in object detection, as they can eliminate the post-processes like NMS and enrich the deep representations. However, these methods cannot well cope with scene text due to its extreme variance of scales and aspect ratios. In this paper, we present a simple yet effective transformer-based architecture for scene text detection. Different from previous approaches that learn robust deep representations of scene text in a holistic manner, our method performs scene text detection based on a few representative features, which avoids the disturbance by background and reduces the computational cost. Specifically, we first select a few representative features at all scales that are highly relevant to foreground text. Then, we adopt a transformer for modeling the relationship of the sampled features, which effectively divides them into reasonable groups. As each feature group corresponds to a text instance, its bounding box can be easily obtained without any post-processing operation. Using the basic feature pyramid network for feature extraction, our method consistently achieves state-of-the-art results on several popular datasets for scene text detection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题