用多模式变压器检测表达式

论文标题

用多模式变压器检测表达式

Detecting expressions with multimodal transformers

论文作者

Parthasarathy, Srinivas, Sundaram, Shiva

论文摘要

开发机器学习算法以了解人与人之间的参与可能会导致Amazon Alexa等公共设备的自然用户体验。在其他提示（例如语音活动和凝视）中，一个人的视听表达包括语音和面部表达，这是当事人之间在对话框中互动的隐含信号。这项研究研究了对用户表达的视听检测的深度学习算法。我们首先实施了带有反复层的视听基线模型，该模型与当前的最新状态相比显示出竞争性的结果。接下来，我们建议使用编码器层的变压器体系结构，以更好地整合表达跟踪的视听功能。在AFF-WILD2数据库上的性能表明，所提出的方法的性能优于基线体系结构，其复发层的唤醒和价描述符约为2％。此外，多模式体系结构比在单个模式上训练的模型显示出显着改善，高达3.6％。消融研究表明，视觉方式对AFF-WILD2数据库的表达检测的重要性。

Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person's audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user's expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.

下载PDF全文

下载文献需遵守相关版权规定

论文标题