有效激活的变压器有效

论文标题

有效激活的变压器有效

Efficient Sparsely Activated Transformers

论文作者

Latifi, Salar, Muralidharan, Saurav, Garland, Michael

论文摘要

基于变压器的神经网络已在许多机器学习领域（包括自然语言处理和计算机视觉）中实现了最新的任务性能。为了进一步提高其准确性，最近的工作探讨了动态行为的整合到这些网络中，形式是Expert（MOE）层的混合物。在本文中，我们探讨了MOE层的引入以优化不同的指标：推理潜伏期。我们介绍了一个名为Planer的新型系统，该系统采用现有的基于变压器的网络和一个用户定义的延迟目标，并生成了原始网络的优化，稀疏激活的版本，该版本试图满足潜伏期目标，同时保持基线准确性。我们使用变压器-XL网络对两个现实世界的语言建模任务进行评估，并在ISO-Accuracy上实现超过2倍的推理潜伏期降低。

Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains including natural language processing and computer vision. To further improve their accuracy, recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert (MoE) layers. In this paper, we explore the introduction of MoE layers to optimize a different metric: inference latency. We introduce a novel system named PLANER that takes an existing Transformer-based network and a user-defined latency target and produces an optimized, sparsely-activated version of the original network that tries to meet the latency target while maintaining baseline accuracy. We evaluate PLANER on two real-world language modeling tasks using the Transformer-XL network and achieve inference latency reductions of over 2x at iso-accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题