论文标题
Rakugo语音合成器与专业表演者有多相似或不同?
How Similar or Different Is Rakugo Speech Synthesizer to Professional Performers?
论文作者
论文摘要
我们一直致力于Rakugo(一种与单人站立喜剧相似的日本传统形式的口头娱乐形式)的语音综合,旨在真实地娱乐观众。在本文中,我们提出了一种新颖的评估方法,使用综合的Rakugo演讲和由三个不同等级的专业表演者发表的真实Rakugo演讲。综合演讲的自然性与人类言论的自然性相当,但是综合演讲使听众比任何等级的表演者都招待了。但是,为了获得真正有趣的Rakugo合成器,我们对要解决的挑战获得了一些有趣的见解。例如,自然性并不是最重要的因素,即使通常强调它是在常规语音合成领域中要评估的最重要点。更重要的因素是Rakugo故事中内容的可理解性和角色的区分性,与专业表演者相比,合成的Rakugo演讲相对较低。我们还发现,应进一步改进FO建模的基本频率以更好地招待观众。这些结果显示了达到真正有趣的语音综合的重要步骤。
We have been working on speech synthesis for rakugo (a traditional Japanese form of verbal entertainment similar to one-person stand-up comedy) toward speech synthesis that authentically entertains audiences. In this paper, we propose a novel evaluation methodology using synthesized rakugo speech and real rakugo speech uttered by professional performers of three different ranks. The naturalness of the synthesized speech was comparable to that of the human speech, but the synthesized speech entertained listeners less than the performers of any rank. However, we obtained some interesting insights into challenges to be solved in order to achieve a truly entertaining rakugo synthesizer. For example, naturalness was not the most important factor, even though it has generally been emphasized as the most important point to be evaluated in the conventional speech synthesis field. More important factors were the understandability of the content and distinguishability of the characters in the rakugo story, both of which the synthesized rakugo speech was relatively inferior at as compared with the professional performers. We also found that fundamental frequency fo modeling should be further improved to better entertain audiences. These results show important steps to reaching authentically entertaining speech synthesis.