相似性和基于内容的语音自我关注语音识别

论文标题

相似性和基于内容的语音自我关注语音识别

Similarity and Content-based Phonetic Self Attention for Speech Recognition

论文作者

Shim, Kyuhong, Sung, Wonyong

论文摘要

由于使用特征提取过程中的每个框架，基于变压器的语音识别模型取得了巨大的成功。尤其是，下层中的SA头通过查询键点产品捕获了各种语音特性，该点旨在计算帧之间的成对关系。在本文中，我们提出了一种SA的变体来提取更多代表性的语音特征。提出的语音自我注意力（PHSA）由两种不同类型的语音注意组成。一个是基于相似性的，另一个是基于内容的。简而言之，基于相似性的注意力捕获了帧之间的相关性，而基于内容的注意力仅考虑每个帧而不会受到其他帧影响。我们确定原始点产品方程的哪些部分与两种不同的注意力模式有关，并通过简单的修改改善每个部分。我们在音素分类和语音识别方面的实验表明，用PHSA代替下层SA可改善识别性能，而无需增加潜伏期和参数大小。

Transformer-based speech recognition models have achieved great success due to the self-attention (SA) mechanism that utilizes every frame in the feature extraction process. Especially, SA heads in lower layers capture various phonetic characteristics by the query-key dot product, which is designed to compute the pairwise relationship between frames. In this paper, we propose a variant of SA to extract more representative phonetic features. The proposed phonetic self-attention (phSA) is composed of two different types of phonetic attention; one is similarity-based and the other is content-based. In short, similarity-based attention captures the correlation between frames while content-based attention only considers each frame without being affected by other frames. We identify which parts of the original dot product equation are related to two different attention patterns and improve each part with simple modifications. Our experiments on phoneme classification and speech recognition show that replacing SA with phSA for lower layers improves the recognition performance without increasing the latency and the parameter size.

下载PDF全文

下载文献需遵守相关版权规定

论文标题