变形器：分解预训练的变压器以进行更快的问题回答

论文标题

变形器：分解预训练的变压器以进行更快的问题回答

DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

论文作者

Cao, Qingqing, Trivedi, Harsh, Balasubramanian, Aruna, Balasubramanian, Niranjan

论文摘要

基于变压器的QA模型在所有层次上都使用输入范围的自我注意力（即在问题和输入段落中），从而使它们缓慢且内存密集。事实证明，我们可以在所有层，尤其是在下层的所有层中都能通过。我们介绍了变形者，这是一种分解的变压器，在较低层中用问题和通过范围内的自我展示代替了全部自我注意力。这允许对输入文本表示无关的处理处理，这又使预计段落表示会大大减少运行时计算。此外，由于变形器在很大程度上与原始模型相似，因此我们可以使用标准变压器的预训练权重初始化变形，并直接在目标QA数据集中进行微调。我们显示BERT和XLNET的变形版本可用于将QA加快4.3倍以上，并且基于简单的基于蒸馏的损失，它们的准确性仅下降了1％。我们在https://github.com/stonybrooknlp/deformer上开放代码。

Transformer-based QA models use input-wide self-attention -- i.e. across both the question and the input passage -- at all layers, causing them to be slow and memory-intensive. It turns out that we can get by without input-wide self-attention at all layers, especially in the lower layers. We introduce DeFormer, a decomposed transformer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers. This allows for question-independent processing of the input text representations, which in turn enables pre-computing passage representations reducing runtime compute drastically. Furthermore, because DeFormer is largely similar to the original model, we can initialize DeFormer with the pre-training weights of a standard transformer, and directly fine-tune on the target QA dataset. We show DeFormer versions of BERT and XLNet can be used to speed up QA by over 4.3x and with simple distillation-based losses they incur only a 1% drop in accuracy. We open source the code at https://github.com/StonyBrookNLP/deformer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题