Timit-TTS：多模式合成媒体检测的文本到语音数据集

论文标题

Timit-TTS：多模式合成媒体检测的文本到语音数据集

TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection

论文作者

Salvi, Davide, Hosler, Brian, Bestagini, Paolo, Stamm, Matthew C., Tubaro, Stefano

论文摘要

随着深度学习技术的快速发展，多媒体材料的产生和伪造变得越来越简单地执行。同时，在网络上共享虚假内容变得如此简单，以至于恶意用户可以以最少的精力创造不愉快的情况。同样，锻造媒体正在变得越来越复杂，操纵视频正在播放静止图像。多媒体法医社区已经解决了可能通过验证多媒体对象真实性的检测器来暗示的可能威胁。但是，这些工具中的绝大多数仅一次分析一种方式。只要静止图像被认为是最广泛编辑的媒体，这并不是一个问题，但是现在，由于操纵视频正在习惯化，因此进行单差分析可能会减少。尽管如此，文献缺乏有关多模式探测器的文献，这主要是由于包含伪造的多模式数据的数据集的恐惧性来训练和测试设计的算法。在本文中，我们专注于产生视听胶合蛋白数据集。首先，我们提出了一条一般的管道，用于从给定的真实视频或虚假视频中综合语音深层内容，从而促进伪造的多模式材料的创建。所提出的方法使用文本到语音（TTS）和动态的时间扭曲技术来实现现实的语音轨道。然后，我们使用管道生成和释放Timit-TT，这是一种包含TTS字段中最尖端方法的合成语音数据集。这可以用作独立的音频数据集，也可以与其他最先进的集合一起进行多模式研究。最后，我们提出了许多实验，以基准在单模式和多模式条件下基准拟议的数据集，以表明需要多模态法证探测器和更合适的数据。

With the rapid development of deep learning techniques, the generation and counterfeiting of multimedia material are becoming increasingly straightforward to perform. At the same time, sharing fake content on the web has become so simple that malicious users can create unpleasant situations with minimal effort. Also, forged media are getting more and more complex, with manipulated videos that are taking the scene over still images. The multimedia forensic community has addressed the possible threats that this situation could imply by developing detectors that verify the authenticity of multimedia objects. However, the vast majority of these tools only analyze one modality at a time. This was not a problem as long as still images were considered the most widely edited media, but now, since manipulated videos are becoming customary, performing monomodal analyses could be reductive. Nonetheless, there is a lack in the literature regarding multimodal detectors, mainly due to the scarsity of datasets containing forged multimodal data to train and test the designed algorithms. In this paper we focus on the generation of an audio-visual deepfake dataset. First, we present a general pipeline for synthesizing speech deepfake content from a given real or fake video, facilitating the creation of counterfeit multimodal material. The proposed method uses Text-to-Speech (TTS) and Dynamic Time Warping techniques to achieve realistic speech tracks. Then, we use the pipeline to generate and release TIMIT-TTS, a synthetic speech dataset containing the most cutting-edge methods in the TTS field. This can be used as a standalone audio dataset, or combined with other state-of-the-art sets to perform multimodal research. Finally, we present numerous experiments to benchmark the proposed dataset in both mono and multimodal conditions, showing the need for multimodal forensic detectors and more suitable data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题