Wabemix：图像的资源有效令牌混合

论文标题

Wabemix：图像的资源有效令牌混合

WaveMix: Resource-efficient Token Mixing for Images

论文作者

Jeevan, Pranav, Sethi, Amit

论文摘要

尽管某些视力变压器（VIT）和CNN架构在视觉任务上很好地概括了，但由于它们的培训甚至测试的计算要求，在绿色，边缘或台式计算上使用它们通常是不切实际的。我们将WAVEMIX作为一种替代神经结构，它使用多尺度2D离散小波变换（DWT）进行空间令牌混合。与VIT不同，Wavemix既不展开图像，也不需要对二次复杂性进行自我注意。此外，DWT还引入了另一个电感偏差 - 除了卷积过滤外 - 利用图像的2D结构来改善概括。与CNN相比，DWT的多尺度性质也减少了对更深层次的结构的要求，因为后者依赖于部分空间混合。 WAVEMIX模型显示了与VIT，CNN和令牌混合器在几个数据集上具有竞争力的概括，同时需要较低的GPU RAM（训练和测试），计算数量和存储。 Wavemix已获得了最新的（SOTA）结果，从而在Emnist By-Class和Emnist平衡数据集中获得了结果。

Although certain vision transformer (ViT) and CNN architectures generalize well on vision tasks, it is often impractical to use them on green, edge, or desktop computing due to their computational requirements for training and even testing. We present WaveMix as an alternative neural architecture that uses a multi-scale 2D discrete wavelet transform (DWT) for spatial token mixing. Unlike ViTs, WaveMix neither unrolls the image nor requires self-attention of quadratic complexity. Additionally, DWT introduces another inductive bias -- besides convolutional filtering -- to utilize the 2D structure of an image to improve generalization. The multi-scale nature of the DWT also reduces the requirement for a deeper architecture compared to the CNNs, as the latter relies on pooling for partial spatial mixing. WaveMix models show generalization that is competitive with ViTs, CNNs, and token mixers on several datasets while requiring lower GPU RAM (training and testing), number of computations, and storage. WaveMix have achieved State-of-the-art (SOTA) results in EMNIST Byclass and EMNIST Balanced datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题