通过稀疏的注意力和动态管道进行的长度自适应算法 - 硬件在FPGA上的连接器共同设计

论文标题

通过稀疏的注意力和动态管道进行的长度自适应算法 - 硬件在FPGA上的连接器共同设计

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

论文作者

Peng, Hongwu, Huang, Shaoyi, Chen, Shiyang, Li, Bingbing, Geng, Tong, Li, Ang, Jiang, Weiwen, Wen, Wujie, Bi, Jinbo, Liu, Hang, Ding, Caiwen

论文摘要

变形金刚被认为是自2018年以来最重要的深度学习模型之一，部分原因是它建立了最先进的记录（SOTA）记录，并有可能取代现有的深神经网络（DNNS）。尽管取得了显着的胜利，但变压器模型的延长周转时间是公认的障碍。序列长度的多样性施加了其他计算开销，其中需要将输入零填充到批处理中的最大句子长度，以容纳并行计算平台。本文针对现场可编程的门阵列（FPGA），并提出了用于变压器加速度的连贯序列长度自适应算法 - 硬件共同设计。特别是，我们开发了一种适合硬件的稀疏注意操作员和长度感知的硬件资源调度算法。提出的稀疏注意操作员将基于注意力的模型的复杂性降低到线性复杂性，并减轻片外存储器流量。提出的长度感知资源硬件调度算法动态分配了硬件资源以填充管道插槽并消除了NLP任务的气泡。实验表明，与CPU和GPU实施相比，我们的设计具有很小的精度损失，并且具有80.2 $ \ times $和2.6 $ \ times $速度，并且比通过Cublas Gemm优化的先进的GPU加速器高4 $ \ times $ $。

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the off-chip memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks. Experiments show that our design has very small accuracy loss and has 80.2 $\times$ and 2.6 $\times$ speedup compared to CPU and GPU implementation, and 4 $\times$ higher energy efficiency than state-of-the-art GPU accelerator optimized via CUBLAS GEMM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题