论文标题
音序器:图像分类的深度LSTM
Sequencer: Deep LSTM for Image Classification
论文作者
论文摘要
在最近的计算机视觉研究中,Vision Transformer(VIT)的出现迅速彻底改变了各种建筑设计工作:使用自然语言处理中发现的自我注意力实现了最新的图像分类性能,MLP-Mixer使用简单的多层PercePtron实现了竞争性能。相比之下,一些研究还表明,精心重新设计的卷积神经网络(CNN)可以实现与VIT相当的先进性能,而无需诉诸这些新想法。在这种背景下,越来越多的感应偏差适合计算机视觉。在这里,我们提出了Suequencer,这是VIT的一种新颖且具有竞争力的体系结构,可为这些问题提供新的看法。与VIT不同,音序器使用LSTM而不是自我发项层模型。我们还提出了二维版本的音序模块,其中LSTM分解为垂直和水平LSTM,以增强性能。尽管它很简单,但一些实验表明,Sequencer表现出色:Sequencer2d-L,具有5400万参数,仅在Imagenet-1K上实现了84.6%的TOP-1精度。不仅如此,我们还表明它具有良好的可传递性和在双分辨率波段上具有强大的分辨率适应性。
In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band.