视觉的元构基准

论文标题

视觉的元构基准

MetaFormer Baselines for Vision

论文作者

Yu, Weihao, Si, Chenyang, Zhou, Pan, Luo, Mi, Zhou, Yichen, Feng, Jiashi, Yan, Shuicheng, Wang, Xinchao

论文摘要

Metaformer是变压器的抽象体系结构，在实现竞争性能中发挥了重要作用。在本文中，我们再次进一步探讨了元构造器的能力，而无需关注令牌搅拌机设计：我们使用最基本或最常见的混合器在元构象方下介绍了几种基线模型，并总结了我们的观察结果，如下所示：（1）元配置器确保固体稳固的性能下降。通过仅采用身份映射作为令牌混合器，称为IdentityFormer的元构造器模型在Imagenet-1k上的精度> 80％。（2）元构造器与任意令牌混合器配合得很好。当将令牌混合器指定为混合代币的随机矩阵时，所得模型randformer的准确性> 81％，表现优于身份表面。当采用新的令牌搅拌机时，请放心元构造的结果。（3）Metaformer毫不费力地提供最先进的结果。只需五年前的传统令牌混合器的历史，元构造器就已经击败了最新技术。（a）交货者的表现优于condnext。将常见的深度可分离卷积作为令牌混合器，该模型称为交货器，可以被视为纯CNN，表现优于强CNN模型Convnext。（b）CAFORMER在Imagenet-1k上设置新记录。通过简单地将深度分离的卷积应用于底部阶段的令牌混合器，并在最高阶段应用了香草自我注意力，在未经外部数据或蒸馏的正常监督培训下，最终的模型Caformer在Imagenet-1K上创下了新的记录：在224X222分辨率下，它的精度为224X2224分辨率。在探测探测元构造器的探险中，我们还发现，与GELU相比，Starrelu的新激活降低了71％的激活量，但实现了更好的性能。我们希望Starrelu与其他神经网络一起在类似元构象的模型中找到巨大的潜力。

MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, without focusing on token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance. By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works well with arbitrary token mixers. When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results when new token mixers are adopted. (3) MetaFormer effortlessly offers state-of-the-art results. With just conventional token mixers dated back five years ago, the models instantiated from MetaFormer already beat state of the art. (a) ConvFormer outperforms ConvNeXt. Taking the common depthwise separable convolutions as the token mixer, the model termed ConvFormer, which can be regarded as pure CNNs, outperforms the strong CNN model ConvNeXt. (b) CAFormer sets new record on ImageNet-1K. By simply applying depthwise separable convolutions as token mixer in the bottom stages and vanilla self-attention in the top stages, the resulting model CAFormer sets a new record on ImageNet-1K: it achieves an accuracy of 85.5% at 224x224 resolution, under normal supervised training without external data or distillation. In our expedition to probe MetaFormer, we also find that a new activation, StarReLU, reduces 71% FLOPs of activation compared with GELU yet achieves better performance. We expect StarReLU to find great potential in MetaFormer-like models alongside other neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题