论文标题
可伸缩率:重新思考视觉变压器的上下文概括
ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer
论文作者
论文摘要
香草自我注意的机制固有地依赖于预定且坚定的计算维度。这种僵化的性限制了它具有面向上下文的概括,可以带来更多的上下文提示和全球表示。为了减轻此问题,我们提出了一种可扩展的自我注意(SSA)机制,该机制利用两个缩放因素释放查询,键和价值矩阵的尺寸,同时将其与输入不符合输入。这种可伸缩性可提供面向上下文的概括并增强对象灵敏度,从而将整个网络推向准确性和成本之间的更有效的权衡状态。此外,我们提出了一个基于窗口的自我发挥(IWSA),该自我发项(IWSA)通过重新合并独立的值代币并从相邻窗口中汇总空间信息来建立非重叠区域之间的相互作用。通过交替堆叠SSA和IWSA,可扩展的视觉变压器(可伸缩率)在通用视觉任务中实现了最先进的性能。例如,在Imagenet-1K分类中,可伸缩率S的表现优于双胞胎-SVT-S,而SWIN-T则比1.8%。
The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global representations. To mitigate this issue, we propose a Scalable Self-Attention (SSA) mechanism that leverages two scaling factors to release dimensions of query, key, and value matrices while unbinding them with the input. This scalability fetches context-oriented generalization and enhances object sensitivity, which pushes the whole network into a more effective trade-off state between accuracy and cost. Furthermore, we propose an Interactive Window-based Self-Attention (IWSA), which establishes interaction between non-overlapping regions by re-merging independent value tokens and aggregating spatial information from adjacent windows. By stacking the SSA and IWSA alternately, the Scalable Vision Transformer (ScalableViT) achieves state-of-the-art performance in general-purpose vision tasks. For example, ScalableViT-S outperforms Twins-SVT-S by 1.4% and Swin-T by 1.8% on ImageNet-1K classification.