J-MAC：日本多演讲者有声读物语料库用于语音综合

论文标题

J-MAC：日本多演讲者有声读物语料库用于语音综合

J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis

论文作者

Takamichi, Shinnosuke, Nakata, Wataru, Tanji, Naoko, Saruwatari, Hiroshi

论文摘要

在本文中，我们构建了一个名为“ J-MAC”的日本有声读物语音，用于语音合成研究。随着阅读风格的语音综合的成功，研究目标正在转移到使用复杂环境的任务。有声读物的语音综合是一个很好的例子，需要交叉句子，表现力等。与阅读风格的语音不同，有声读物中的扬声器特定表现力也成为上下文。为了增强这项研究，我们提出了一种构建专业演讲者阅读的有声读物的语料库的方法。从许多有声读物及其文本中，我们的方法可以在没有任何语言依赖性的情况下自动提取和完善数据。具体而言，我们使用声音乐器分离来提取清洁数据，连接式的时间分类以大致对齐文本和音频以及语音活动检测以完善对齐方式。 J-MAC在我们的项目页面中开源。我们还进行了有声读物的语音综合评估，结果可以洞悉有声读物语音综合。

In this paper, we construct a Japanese audiobook speech corpus called "J-MAC" for speech synthesis research. With the success of reading-style speech synthesis, the research target is shifting to tasks that use complicated contexts. Audiobook speech synthesis is a good example that requires cross-sentence, expressiveness, etc. Unlike reading-style speech, speaker-specific expressiveness in audiobook speech also becomes the context. To enhance this research, we propose a method of constructing a corpus from audiobooks read by professional speakers. From many audiobooks and their texts, our method can automatically extract and refine the data without any language dependency. Specifically, we use vocal-instrumental separation to extract clean data, connectionist temporal classification to roughly align text and audio, and voice activity detection to refine the alignment. J-MAC is open-sourced in our project page. We also conduct audiobook speech synthesis evaluations, and the results give insights into audiobook speech synthesis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题