论文标题

闪存:具有IO意识的快速和记忆力精确的关注

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

论文作者

Dao, Tri, Fu, Daniel Y., Ermon, Stefano, Rudra, Atri, Ré, Christopher

论文摘要

变形金刚在长序列上是缓慢的,渴望记忆力,因为自我注意的时间和记忆复杂性在序列上是二次的。近似关注方法试图通过交易模型质量以降低计算复杂性来解决此问题,但通常无法实现墙壁锁定的加速。我们认为,缺失的原则是提出注意算法io-Aware-考虑读取和在GPU内存级别之间写入。我们提出了FlashAttention,这是一种IO-Aware精确注意算法,该算法使用平铺来减少GPU高带宽内存(HBM)和GPU芯片SRAM之间的记忆读数/写入/写入。我们分析了闪存的IO复杂性,表明它所需的HBM访问量比标准注意力少,并且对于一系列SRAM尺寸而言是最佳的。我们还扩展了闪光词,以引起障碍物的注意,从而产生了比任何现有的近似关注方法更快的关注算法。 FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K).闪光灯和障碍物闪光发音可以在变压器中延长上下文,从而产生更高质量的模型(在长期分类中的gpt-2和6.4个提升点上的困惑)和全新的功能:在PATH-x挑战上实现比path-16k挑战的第一个优于更高的智力(第61秒,61.4%精确度)的速度超过5次(61.4%精度)(seq cocy cocce centecy centerive the Path-x Challenge)(61.4%)(seq seq cocipy 3)。

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源