论文标题
双重:无文本口头问题回答的离散口语自适应学习
DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering
论文作者
论文摘要
口头问题回答(SQA)是要从一个问题中找到口语文件的答案,这对于个人助理回复用户的查询至关重要。现有的SQA方法均取决于自动语音识别(ASR)成绩单。不仅需要使用大量的注释数据对ASR进行培训,这些数据是时间且成本良好的低资源语言的收集,而且更重要的是,这些问题的答案通常包括名称实体或无法正确识别的唱片词。同样,ASR旨在最大程度地减少所有单词的识别错误,包括与SQA任务无关的许多函数单词。因此,尽管已知非常困难,但没有ASR转录本的SQA总是高度期望的。 这项工作提出了离散的口语自适应学习(双重),利用未标记的数据进行预训练,并通过SQA下游任务进行了微调。口语答案的时间间隔可以直接从口语文件预测。我们还发布了一个新的SQA基准语料库NMSQA,以了解具有更现实的方案的数据。我们凭经验表明,双重收益结果可与级联ASR和文本质量质量质量检查模型获得的结果相媲美,并与现实世界中的数据相当。我们的代码和模型将是开源的。
Spoken Question Answering (SQA) is to find the answer from a spoken document given a question, which is crucial for personal assistants when replying to the queries from the users. Existing SQA methods all rely on Automatic Speech Recognition (ASR) transcripts. Not only does ASR need to be trained with massive annotated data that are time and cost-prohibitive to collect for low-resourced languages, but more importantly, very often the answers to the questions include name entities or out-of-vocabulary words that cannot be recognized correctly. Also, ASR aims to minimize recognition errors equally over all words, including many function words irrelevant to the SQA task. Therefore, SQA without ASR transcripts (textless) is always highly desired, although known to be very difficult. This work proposes Discrete Spoken Unit Adaptive Learning (DUAL), leveraging unlabeled data for pre-training and fine-tuned by the SQA downstream task. The time intervals of spoken answers can be directly predicted from spoken documents. We also release a new SQA benchmark corpus, NMSQA, for data with more realistic scenarios. We empirically showed that DUAL yields results comparable to those obtained by cascading ASR and text QA model and robust to real-world data. Our code and model will be open-sourced.