论文标题
变压器与MLP混合:NLP问题的指数表达差距
Transformer Vs. MLP-Mixer: Exponential Expressive Gap For NLP Problems
论文作者
论文摘要
视力转换器被广泛用于各种视觉任务。同时,从MLP-Mixer开始尝试使用基于MLP的体系结构实现类似性能开始的另一项作品。有趣的是,直到现在,这些基于MLP的架构尚未适应NLP任务。此外,到目前为止,基于MLP的体系结构还未能在视力任务中实现最新的性能。在本文中,我们同时分析了基于MLP的架构在建模依赖性中的表达能力,同时在多个不同输入之间进行建模,并显示了注意力与基于MLP的机制之间的指数差距。我们的结果表明,MLP无法与NLP问题中的基于注意力的机制竞争的理论解释,它们还表明,视觉任务的性能差距可能是由于MLP在多个不同位置之间建模依赖性的相对弱点所致,并且将智能输入置换量与MLP架构组合到足够的范围内可能无法缩小表现差距。
Vision-Transformers are widely used in various vision tasks. Meanwhile, there is another line of works starting with the MLP-mixer trying to achieve similar performance using mlp-based architectures. Interestingly, until now those mlp-based architectures have not been adapted for NLP tasks. Additionally, until now, mlp-based architectures have failed to achieve state-of-the-art performance in vision tasks. In this paper, we analyze the expressive power of mlp-based architectures in modeling dependencies between multiple different inputs simultaneously, and show an exponential gap between the attention and the mlp-based mechanisms. Our results suggest a theoretical explanation for the mlp inability to compete with attention-based mechanisms in NLP problems, they also suggest that the performance gap in vision tasks may be due to the mlp relative weakness in modeling dependencies between multiple different locations, and that combining smart input permutations with mlp architectures may not be enough to close the performance gap alone.