有效的语言建模，稀疏的All-MLP

论文标题

有效的语言建模，稀疏的All-MLP

Efficient Language Modeling with Sparse all-MLP

论文作者

Yu, Ping, Artetxe, Mikel, Ott, Myle, Shleifer, Sam, Gong, Hongyu, Stoyanov, Ves, Li, Xian

论文摘要

所有MLP架构引起了人们越来越多的兴趣，作为基于注意力的模型的替代方法。在NLP中，像GMLP这样的最新工作表明，所有MLP可以匹配语言建模中的变形金刚，但在下游任务中仍然落后。在这项工作中，我们分析了MLP在表达性方面的局限性，并在特征和输入（令牌）维度中提出了用experts（MOE）的混合物（MOES）稀疏激活的MLP。如此稀疏的所有MLP显着提高了模型的能力和表现力，同时保持计算恒定。我们应对将条件计算与两个路由策略结合在一起的关键挑战。与基于变压器的MOE（GSHARD，SWHARD TRONSSIERER，基本层和哈希层）以及密集的变压器和全MLP相比，提出的稀疏全MLP改善了语言建模的困惑，并获得了多达2 $ \ times训练效率的提高训练效率。最后，我们在六个下游任务上评估了其零射击的内在学习性能，并发现它超过了基于变压器的MOE和密集的变压器。

All-MLP architectures have attracted increasing interest as an alternative to attention-based models. In NLP, recent work like gMLP shows that all-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks. In this work, we analyze the limitations of MLPs in expressiveness, and propose sparsely activated MLPs with mixture-of-experts (MoEs) in both feature and input (token) dimensions. Such sparse all-MLPs significantly increase model capacity and expressiveness while keeping the compute constant. We address critical challenges in incorporating conditional computation with two routing strategies. The proposed sparse all-MLP improves language modeling perplexity and obtains up to 2$\times$ improvement in training efficiency compared to both Transformer-based MoEs (GShard, Switch Transformer, Base Layers and HASH Layers) as well as dense Transformers and all-MLPs. Finally, we evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题