论文标题
从优化的角度来源的变压器
Transformers from an Optimization Perspective
论文作者
论文摘要
诸如变压器之类的深度学习模型通常是由启发式方法和经验来构建的。为了提供互补的基础,在这项工作中,我们研究了以下问题:是否有可能找到变压器模型的能量函数,使得沿该能量的下降步骤与变压器向前通相对应?通过找到这样的功能,我们可以将变压器视为跨迭代中可解释的优化过程的展开。过去经常采用这种发展的观点,以阐明更直接的深层模型,例如MLP和CNN。但是,到目前为止,对于具有变压器等自我发项机制的更复杂模型,它获得了相似的等效性。为此,我们首先概述了几个主要障碍,然后才提供伴侣技术至少部分地解决这些问题,这首先证明了能量功能最小化与自我注意的深层层之间的紧密关联。这种解释有助于我们对变压器的直觉和理解,同时有可能为新的模型设计奠定基础。
Deep learning models such as the Transformer are often constructed by heuristics and experience. To provide a complementary foundation, in this work we study the following problem: Is it possible to find an energy function underlying the Transformer model, such that descent steps along this energy correspond with the Transformer forward pass? By finding such a function, we can view Transformers as the unfolding of an interpretable optimization process across iterations. This unfolding perspective has been frequently adopted in the past to elucidate more straightforward deep models such as MLPs and CNNs; however, it has thus far remained elusive obtaining a similar equivalence for more complex models with self-attention mechanisms like the Transformer. To this end, we first outline several major obstacles before providing companion techniques to at least partially address them, demonstrating for the first time a close association between energy function minimization and deep layers with self-attention. This interpretation contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.