多域语音识别的实用框架和神经语言建模的实例抽样方法

论文标题

多域语音识别的实用框架和神经语言建模的实例抽样方法

A practical framework for multi-domain speech recognition and an instance sampling method to neural language modeling

论文作者

Zhang, Yike, Feng, Xiaobing, Liu, Yi, Cao, Songjun, Ma, Long

论文摘要

通常需要在智能手机或车辆上使用的自动语音识别（ASR）系统来处理来自非常不同域的语音查询。在这种情况下，香草ASR系统通常无法在每个域上表现良好。本文提出了用于腾讯地图的多域ASR框架，这是一个用于智能手机和车载信息娱乐系统的导航应用程序。所提出的框架由三个核心部分组成：一个基本的ASR模块，用于生成语音查询的n-最佳列表，文本分类模块，以确定语音查询属于哪个域，以及使用域特定语言模型来重新恢复n-pest列表的reranking模块。此外，提出了一种基于实例抽样的培训神经网络语言模型（NNLMS）的方法，以解决多域ASR中的数据不平衡问题。在实验中，提出的框架是在导航域和音乐域上评估的，因为导航和播放音乐是腾讯地图的两个主要特征。与一般的ASR系统相比，所提出的框架在从Tencent Map和我们的车内语音助手那里收集的几个测试集上，相对13％$ \ sim $ 22％的字符错误率。

Automatic speech recognition (ASR) systems used on smart phones or vehicles are usually required to process speech queries from very different domains. In such situations, a vanilla ASR system usually fails to perform well on every domain. This paper proposes a multi-domain ASR framework for Tencent Map, a navigation app used on smart phones and in-vehicle infotainment systems. The proposed framework consists of three core parts: a basic ASR module to generate n-best lists of a speech query, a text classification module to determine which domain the speech query belongs to, and a reranking module to rescore n-best lists using domain-specific language models. In addition, an instance sampling based method to training neural network language models (NNLMs) is proposed to address the data imbalance problem in multi-domain ASR. In experiments, the proposed framework was evaluated on navigation domain and music domain, since navigating and playing music are two main features of Tencent Map. Compared to a general ASR system, the proposed framework achieves a relative 13% $\sim$ 22% character error rate reduction on several test sets collected from Tencent Map and our in-car voice assistant.

下载PDF全文

下载文献需遵守相关版权规定

论文标题