论文标题
视觉变压器用于对象检测
Visual Transformer for Object Detection
论文作者
论文摘要
在许多计算机视觉应用中,卷积神经网络(CNN)一直是范式的首选。然而,卷积操作的弱点很大,它仅在当地的像素社区中运作,因此错过了周围邻居的全球信息。另一方面,变形金刚或更具体的自我发项网络已成为捕获输入的远距离相互作用的最新进展,但它们主要应用于序列建模任务,例如神经机器翻译,图像字幕和其他自然语言处理任务。变形金刚已应用于自然语言相关的任务,并实现了有希望的结果。但是,其在视觉相关任务中的应用远非令人满意。考虑到卷积神经网络和变形金刚的弱点,在本文中,我们考虑使用自我注意力来判别视觉任务,对象检测,作为卷积的替代方案。在本文中,我们提出了模型:DetTransnet。广泛的实验表明,我们的模型在许多不同的模型和尺度(包括重新Net)的同时保持参数数量相似,从而导致可可对象检测的对象检测一致。特别是,我们的方法比其他基线模型实现了可可对象检测任务的平均精度提高1.2%。
Convolutional Neural networks (CNN) have been the first choice of paradigm in many computer vision applications. The convolution operation however has a significant weakness which is it only operates on a local neighborhood of pixels, thus it misses global information of the surrounding neighbors. Transformers, or Self-attention networks to be more specific, on the other hand, have emerged as a recent advance to capture long range interactions of the input, but they have mostly been applied to sequence modeling tasks such as Neural Machine Translation, Image captioning and other Natural Language Processing tasks. Transformers has been applied to natural language related tasks and achieved promising results. However, its applications in visual related tasks are far from being satisfying. Taking into consideration of both the weaknesses of Convolutional Neural Networks and those of the Transformers, in this paper, we consider the use of self-attention for discriminative visual tasks, object detection, as an alternative to convolutions. In this paper, we propose our model: DetTransNet. Extensive experiments show that our model leads to consistent improvements in object detection on COCO across many different models and scales, including ResNets, while keeping the number of parameters similar. In particular, our method achieves a 1.2% Average Precision improvement on COCO object detection task over other baseline models.