高网格：有效的多任务变压器，具有网格分解的超预测

论文标题

高网格：有效的多任务变压器，具有网格分解的超预测

HyperGrid: Efficient Multi-Task Transformers with Grid-wise Decomposable Hyper Projections

论文作者

Tay, Yi, Zhao, Zhe, Bahri, Dara, Metzler, Donald, Juan, Da-Cheng

论文摘要

在自然语言理解任务上实现最先进的表现通常依赖于为每个任务进行微调。因此，这种方法会导致更高的总参数成本，并为多种模型提供更高的技术维护。学习一个能够为所有任务做好的多任务模型是一个具有挑战性但有吸引力的主张。在本文中，我们建议\ textsc {hypergrid}，这是一种高效多任务学习的新方法。提出的方法基于一个可分解的超级net工作，该方法可以学习网格的预测，从而有助于将重量矩阵中的区域专业用于不同任务。为了构建所提出的超网络，我们的方法了解了全局（任务无关）状态与特定于局部任务的状态之间的相互作用和组成。我们在当前的最新T5模型上应用了建议的\ textsc {hyperGrid}，在仅使用单个多任务模型时，在胶水和超级粘液基准测试中表明了强劲的性能。我们的方法有助于弥合微调和多任务学习方法之间的差距。

Achieving state-of-the-art performance on natural language understanding tasks typically relies on fine-tuning a fresh model for every task. Consequently, this approach leads to a higher overall parameter cost, along with higher technical maintenance for serving multiple models. Learning a single multi-task model that is able to do well for all the tasks has been a challenging and yet attractive proposition. In this paper, we propose \textsc{HyperGrid}, a new approach for highly effective multi-task learning. The proposed approach is based on a decomposable hypernetwork that learns grid-wise projections that help to specialize regions in weight matrices for different tasks. In order to construct the proposed hypernetwork, our method learns the interactions and composition between a global (task-agnostic) state and a local task-specific state. We apply our proposed \textsc{HyperGrid} on the current state-of-the-art T5 model, demonstrating strong performance across the GLUE and SuperGLUE benchmarks when using only a single multi-task model. Our method helps bridge the gap between fine-tuning and multi-task learning approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题