论文标题

部分可观测时空混沌系统的无模型预测

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

论文作者

Jelčicová, Zuzana, Verhelst, Marian

论文摘要

多头自我发作构成了变压器网络的核心。但是,它们相对于输入序列长度的四次增长,阻碍了它们在资源受限的边缘设备上的部署。我们通过提出一种动态的修剪方法来应对这一挑战,该方法利用了跨代币的数据的时间稳定性来降低推理成本。基于阈值的方法仅保留随后的令牌之间的显着差异,有效地减少了多重蓄能的数量以及内部张量数据的大小。该方法在Google Speech命令数据集上进行评估以进行关键字点,并将性能与基线关键字变压器进行比较。我们的实验表明,我们可以减少约80%的操作,同时保持原始的98.4%精度。此外,只有将精度降低1-4%,可以减少约87-94%的操作,从而加快多头自我注意力专注的推断〜7.5-16倍。

Multi-head self-attention forms the core of Transformer networks. However, their quadratically growing complexity with respect to the input sequence length impedes their deployment on resource-constrained edge devices. We address this challenge by proposing a dynamic pruning method, which exploits the temporal stability of data across tokens to reduce inference cost. The threshold-based method only retains significant differences between the subsequent tokens, effectively reducing the number of multiply-accumulates, as well as the internal tensor data sizes. The approach is evaluated on the Google Speech Commands Dataset for keyword spotting, and the performance is compared against the baseline Keyword Transformer. Our experiments show that we can reduce ~80% of operations while maintaining the original 98.4% accuracy. Moreover, a reduction of ~87-94% operations can be achieved when only degrading the accuracy by 1-4%, speeding up the multi-head self-attention inference by a factor of ~7.5-16.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源