视觉环境：从视觉拟合词和声音源图像中的环境声音合成

论文标题

视觉环境：从视觉拟合词和声音源图像中的环境声音合成

Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeias and sound-source images

论文作者

Ohnaka, Hien, Takamichi, Shinnosuke, Imoto, Keisuke, Okamoto, Yuki, Fujii, Kazuki, Saruwatari, Hiroshi

论文摘要

我们提出了一种从视觉上代表的拟声词和声音来源合成环境声音的方法。拟声词是一个模仿声音结构的单词，即声音的文本表示。从这个角度来看，已经提出了Onoma-to-Wave，以从所需的拟声道文本中综合环境声音。拟声词有另一个表示：漫画，广告和虚拟现实中声音的视觉文本表示。视觉拟声词（拟声词的视觉文本）包含文本中不存在的丰富信息，例如图像的长期持续时间，因此预计这种表示形式可以综合各种声音。因此，我们提出了视觉上的视觉环境，以从视觉拟声中进行环境声音合成。该方法可以将视觉文本和声音源图像的视觉概念传输到合成的声音中。我们还提出了一种数据增强方法，重点是重复拟声载体，以增强我们的方法的性能。实验评估表明，这些方法可以从视觉文本和声音源图像中综合各种环境声音。

We propose a method for synthesizing environmental sounds from visually represented onomatopoeias and sound sources. An onomatopoeia is a word that imitates a sound structure, i.e., the text representation of sound. From this perspective, onoma-to-wave has been proposed to synthesize environmental sounds from the desired onomatopoeia texts. Onomatopoeias have another representation: visual-text representations of sounds in comics, advertisements, and virtual reality. A visual onomatopoeia (visual text of onomatopoeia) contains rich information that is not present in the text, such as a long-short duration of the image, so the use of this representation is expected to synthesize diverse sounds. Therefore, we propose visual onoma-to-wave for environmental sound synthesis from visual onomatopoeia. The method can transfer visual concepts of the visual text and sound-source image to the synthesized sound. We also propose a data augmentation method focusing on the repetition of onomatopoeias to enhance the performance of our method. An experimental evaluation shows that the methods can synthesize diverse environmental sounds from visual text and sound-source images.

下载PDF全文

下载文献需遵守相关版权规定

论文标题