梯度疫苗：研究和改善大规模多语言模型中的多任务优化

论文标题

梯度疫苗：研究和改善大规模多语言模型中的多任务优化

Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models

论文作者

Wang, Zirui, Tsvetkov, Yulia, Firat, Orhan, Cao, Yuan

论文摘要

大量的多语言模型包含数十个甚至数百种语言对多任务优化构成了巨大的挑战。尽管采用语言不足的过程是一种常见的做法，以优化联合多语言任务目标，但如何正确表征和利用其基本问题结构以提高优化效率的效率仍然不足。在本文中，我们试图通过损失函数几何形状的镜头窥视多语言优化的黑框。我们发现，沿优化轨迹测得的梯度相似性是一个重要的信号，它不仅与语言接近性，而且与整体模型性能息息相关。这样的观察有助于我们确定现有基于梯度的多任务学习方法的关键限制，因此我们得出了一个简单且可扩展的优化过程，称为梯度疫苗，该过程鼓励对密切任务进行更几何对齐的参数更新。从经验上讲，我们的方法在多语言语言模型的多语言机器翻译和Xtreme基准任务上获得了显着的模型性能提高。我们的工作揭示了正确衡量和利用语言邻近性在多语言优化中的重要性，并且对超出多语言建模的多任务学习具有更大的影响。

Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with not only language proximity but also the overall model performance. Such observation helps us to identify a critical limitation of existing gradient-based multi-task learning methods, and thus we derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks. Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. Our work reveals the importance of properly measuring and utilizing language proximity in multilingual optimization, and has broader implications for multi-task learning beyond multilingual modeling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题