大规模跨语言演讲视频的技术管道将其分解为多种印度语言

论文标题

大规模跨语言演讲视频的技术管道将其分解为多种印度语言

Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages

论文作者

Prakash, Anusha, Kumar, Arun, Seth, Ashish, Mukherjee, Bhagyashree, Gupta, Ishika, Kuriakose, Jom, Fernandes, Jordan, Vikram, K V, M, Mano Ranjith Kumar, Mary, Metilda Sagaya, Wajahat, Mohammad, N, Mohana, Batra, Mudit, K, Navina, George, Nihal John, Ravi, Nithya, Mishra, Pruthwik, Srivastava, Sudhanshu, Lodagala, Vasista Sai, Mujadia, Vandan, Vineeth, Kada Sai Venkata, Sukhadia, Vrunda, Sharma, Dipti, Murthy, Hema, Bhattacharya, Pushpak, Umesh, S, Sangal, Rajeev

论文摘要

讲座视频的跨语性配音需要转录原始音频，校正和删除裂开，域名术语发现，文本到文本翻译为目标语言，使用目标语言节奏，文本到语音的同步文本块，然后是等于语言，然后是等于原始视频。当源和目标语言属于不同语言家族时，此任务变得具有挑战性，从而导致了产生的音频持续时间差异。原始演讲者的节奏更加复杂，尤其是对于临时演讲。本文描述了用印度语言半自动化的英语讲座视频的挑战。开发了一个原型，以将讲座配音为9种印度语言。在两个不同的课程上，获得了两种语言，即印地语和泰米尔语的两种语言。将输出视频与MOS（1-5）和唇部同步的原始视频分别以4.09和3.74的成绩进行了比较。人类的努力也减少了75％。

Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies, domain term discovery, text-to-text translation into the target language, chunking of text using target language rhythm, text-to-speech synthesis followed by isochronous lipsyncing to the original video. This task becomes challenging when the source and target languages belong to different language families, resulting in differences in generated audio duration. This is further compounded by the original speaker's rhythm, especially for extempore speech. This paper describes the challenges in regenerating English lecture videos in Indian languages semi-automatically. A prototype is developed for dubbing lectures into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two languages, Hindi and Tamil, on two different courses. The output video is compared with the original video in terms of MOS (1-5) and lip synchronisation with scores of 4.09 and 3.74, respectively. The human effort also reduces by 75%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题