论文标题
主动扬声器检测的端到端视听功能融合
End-To-End Audiovisual Feature Fusion for Active Speaker Detection
论文作者
论文摘要
主动扬声器检测在人机相互作用中起着至关重要的作用。最近,出现了一些端到端的视听框架。但是,由于其复杂性和较大的输入尺寸,这些模型的推理时间没有被探索,也不适用于实时应用。此外,他们探索了一种类似的特征提取策略,该策略在音频和视觉输入中采用了Convnet。这项工作提出了一种新型的两流端到端框架融合通过VGG-M从图像中提取的特征,其原始MEL频率Cepstrum系数特征从音频波形提取。该网络在每个流上附有两个BigRu层,以处理融合之前每个流的时间动态。融合后,将一个BigRu层附着在建模联合时间动力学上。 AVA-ACTIVESPEAKER数据集的实验结果表明,我们的新功能提取策略对嘈杂信号的鲁棒性和推理时间比在这两种模式上使用Convnet的模型更高。所提出的模型预测44.41 ms之内,足够快地用于实时应用程序。我们表现最佳的模型获得了88.929%的精度,与最新工作的检测结果几乎相同。
Active speaker detection plays a vital role in human-machine interaction. Recently, a few end-to-end audiovisual frameworks emerged. However, these models' inference time was not explored and are not applicable for real-time applications due to their complexity and large input size. In addition, they explored a similar feature extraction strategy that employs the ConvNet on audio and visual inputs. This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extracted from the audio waveform. The network has two BiGRU layers attached to each stream to handle each stream's temporal dynamic before fusion. After fusion, one BiGRU layer is attached to model the joint temporal dynamics. The experiment result on the AVA-ActiveSpeaker dataset indicates that our new feature extraction strategy shows more robustness to noisy signals and better inference time than models that employed ConvNet on both modalities. The proposed model predicts within 44.41 ms, which is fast enough for real-time applications. Our best-performing model attained 88.929% accuracy, nearly the same detection result as state-of-the-art -work.