论文标题
通过部分视觉帮助的多编码器基于注意的架构,用于声音识别
Multi-encoder attention-based architectures for sound recognition with partial visual assistance
论文作者
论文摘要
大规模的声音识别数据集通常由从多媒体库中获得的声学记录组成。结果,通常可以利用音频以外的方式来改善为关联任务设计的模型的输出。但是,通常并非所有内容都适用于此类集合的所有样本:例如,原始材料可能在某个时候从源平台中删除,因此,不再获得非审计功能。 我们证明,可以通过将此方法应用于基于注意力的深度学习系统,这是当前在声音识别领域中的最新技术的一部分,可以使用多个编码器框架来解决此问题。更具体地说,我们表明,可以成功地使用提出的模型扩展名将部分可用的视觉信息纳入此类网络的操作过程中,该信息通常仅在训练和推理过程中使用听觉功能。在实验上,我们验证了所考虑的方法会导致许多与音频标记和声音事件检测有关的评估方案中的预测。此外,我们仔细检查了提出的技术的某些属性和局限性。
Large-scale sound recognition data sets typically consist of acoustic recordings obtained from multimedia libraries. As a consequence, modalities other than audio can often be exploited to improve the outputs of models designed for associated tasks. Frequently, however, not all contents are available for all samples of such a collection: For example, the original material may have been removed from the source platform at some point, and therefore, non-auditory features can no longer be acquired. We demonstrate that a multi-encoder framework can be employed to deal with this issue by applying this method to attention-based deep learning systems, which are currently part of the state of the art in the domain of sound recognition. More specifically, we show that the proposed model extension can successfully be utilized to incorporate partially available visual information into the operational procedures of such networks, which normally only use auditory features during training and inference. Experimentally, we verify that the considered approach leads to improved predictions in a number of evaluation scenarios pertaining to audio tagging and sound event detection. Additionally, we scrutinize some properties and limitations of the presented technique.