论文标题

用多模式变压器检测表达式

Detecting expressions with multimodal transformers

论文作者

Parthasarathy, Srinivas, Sundaram, Shiva

论文摘要

开发机器学习算法以了解人与人之间的参与可能会导致Amazon Alexa等公共设备的自然用户体验。在其他提示(例如语音活动和凝视)中,一个人的视听表达包括语音和面部表达,这是当事人之间在对话框中互动的隐含信号。这项研究研究了对用户表达的视听检测的深度学习算法。我们首先实施了带有反复层的视听基线模型,该模型与当前的最新状态相比显示出竞争性的结果。接下来,我们建议使用编码器层的变压器体系结构,以更好地整合表达跟踪的视听功能。在AFF-WILD2数据库上的性能表明,所提出的方法的性能优于基线体系结构,其复发层的唤醒和价描述符约为2%。此外,多模式体系结构比在单个模式上训练的模型显示出显着改善,高达3.6%。消融研究表明,视觉方式对AFF-WILD2数据库的表达检测的重要性。

Developing machine learning algorithms to understand person-to-person engagement can result in natural user experiences for communal devices such as Amazon Alexa. Among other cues such as voice activity and gaze, a person's audio-visual expression that includes tone of the voice and facial expression serves as an implicit signal of engagement between parties in a dialog. This study investigates deep-learning algorithms for audio-visual detection of user's expression. We first implement an audio-visual baseline model with recurrent layers that shows competitive results compared to current state of the art. Next, we propose the transformer architecture with encoder layers that better integrate audio-visual features for expressions tracking. Performance on the Aff-Wild2 database shows that the proposed methods perform better than baseline architecture with recurrent layers with absolute gains approximately 2% for arousal and valence descriptors. Further, multimodal architectures show significant improvements over models trained on single modalities with gains of up to 3.6%. Ablation studies show the significance of the visual modality for the expression detection on the Aff-Wild2 database.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源