生命力：统一统一的低级别和稀疏近似，以使视觉变压器加速与线性泰勒的注意

论文标题

生命力：统一统一的低级别和稀疏近似，以使视觉变压器加速与线性泰勒的注意

ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention

论文作者

Dass, Jyotikrishna, Wu, Shang, Shi, Huihong, Li, Chaojian, Ye, Zhifan, Wang, Zhongfeng, Lin, Yingyan

论文摘要

Vision Transformer（VIT）已成为各种计算机视觉应用的卷积神经网络的竞争替代方法。具体而言，VIT多头注意层使得可以在整个图像中嵌入全球信息。然而，计算和存储此类注意力矩阵会产生二次成本依赖性对贴片数量的依赖性，从而限制了其可实现的效率和可扩展性，并禁止对资源约束设备的更广泛的现实世界中的更广泛的现实世界应用。稀疏的注意力已被证明是提高NLP模型硬件加速效率的有希望的方向。但是，加速VIT模型仍缺少系统的对应方法。为了缩小上述差距，我们提出了一种算法的算法 - 硬件代码，称为活力，以提高VIT的推理效率。与NLP的基于稀疏性的变压器加速器不同，活力在VIT中统一了注意力的低等级和稀疏组成部分。在算法水平上，我们通过一阶泰勒（Taylor）的注意，将DOT产品软件操作近似于Row-均值中心，作为低级别组件，以线性化注意力块的成本，并通过结合基于稀疏性的正则化来进一步提高准确性。在硬件级别上，我们开发了一个专用的加速器，以更好地利用生命力的线性泰勒的注意力和管道，这需要仅执行低级组件，以进一步提高硬件效率。广泛的实验和消融研究验证了活力在可比精度下具有相当的精度，就最先进的解决方案而言，活力提供了提高的端到端效率（例如，$ 3 \ times $ $ 3 \ times $ fimple $）。

Vision Transformer (ViT) has emerged as a competitive alternative to convolutional neural networks for various computer vision applications. Specifically, ViT multi-head attention layers make it possible to embed information globally across the overall image. Nevertheless, computing and storing such attention matrices incurs a quadratic cost dependency on the number of patches, limiting its achievable efficiency and scalability and prohibiting more extensive real-world ViT applications on resource-constrained devices. Sparse attention has been shown to be a promising direction for improving hardware acceleration efficiency for NLP models. However, a systematic counterpart approach is still missing for accelerating ViT models. To close the above gap, we propose a first-of-its-kind algorithm-hardware codesigned framework, dubbed ViTALiTy, for boosting the inference efficiency of ViTs. Unlike sparsity-based Transformer accelerators for NLP, ViTALiTy unifies both low-rank and sparse components of the attention in ViTs. At the algorithm level, we approximate the dot-product softmax operation via first-order Taylor attention with row-mean centering as the low-rank component to linearize the cost of attention blocks and further boost the accuracy by incorporating a sparsity-based regularization. At the hardware level, we develop a dedicated accelerator to better leverage the resulting workload and pipeline from ViTALiTy's linear Taylor attention which requires the execution of only the low-rank component, to further boost the hardware efficiency. Extensive experiments and ablation studies validate that ViTALiTy offers boosted end-to-end efficiency (e.g., $3\times$ faster and $3\times$ energy-efficient) under comparable accuracy, with respect to the state-of-the-art solution.

下载PDF全文

下载文献需遵守相关版权规定

论文标题