低资源印度语言的跨语性和多语言口语检测

论文标题

低资源印度语言的跨语性和多语言口语检测

Cross-lingual and Multilingual Spoken Term Detection for Low-Resource Indian Languages

论文作者

Shah, Sanket, Guha, Satarupa, Khanuja, Simran, Sitaram, Sunayana

论文摘要

口语术语检测（STD）是在音频中搜索单词或短语的任务，将文本输入或口语输入作为查询。在这项工作中，我们使用最先进的印地语，泰米尔语和泰卢固语ASR系统在十种低资源的印度语言中进行词汇口语检测。由于在这些语言中不存在用于口语检测的公开可用数据集，因此我们使用公开可用的TTS数据集创建一个新数据集。我们报告了一个标准的STD，平均任期加权值（MTWV）的标准指标，并表明使用语音类似于目标语言的语言中的ASR系统具有更高的精度，但是，通过使用宽松的电话匹配算法，也可以获得相似语言的高MTWV分数。我们提出了一种技术，以使用公开资源来引导所有正在考虑的语言之间的谱系映射（G2P）映射。当我们结合多个ASR系统的输出以及使用特定语言的语言模型时，将获得收益。我们表明，可以以零拍的方式在不需要任何特定语言的语音数据的情况下以零拍的方式进行交叉语言。我们计划为其他对跨语言性STD感兴趣的研究人员提供性病数据集。

Spoken Term Detection (STD) is the task of searching for words or phrases within audio, given either text or spoken input as a query. In this work, we use state-of-the-art Hindi, Tamil and Telugu ASR systems cross-lingually for lexical Spoken Term Detection in ten low-resource Indian languages. Since no publicly available dataset exists for Spoken Term Detection in these languages, we create a new dataset using a publicly available TTS dataset. We report a standard metric for STD, Mean Term Weighted Value (MTWV) and show that ASR systems built in languages that are phonetically similar to the target languages have higher accuracy, however, it is also possible to get high MTWV scores for dissimilar languages by using a relaxed phone matching algorithm. We propose a technique to bootstrap the Grapheme-to-Phoneme (g2p) mapping between all the languages under consideration using publicly available resources. Gains are obtained when we combine the output of multiple ASR systems and when we use language-specific Language Models. We show that it is possible to perform STD cross-lingually in a zero-shot manner without the need for any language-specific speech data. We plan to make the STD dataset available for other researchers interested in cross-lingual STD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题