多式模式预训练以人为中心的感知

论文标题

多式模式预训练以人为中心的感知

Versatile Multi-Modal Pre-Training for Human-Centric Perception

论文作者

Hong, Fangzhou, Pan, Liang, Cai, Zhongang, Liu, Ziwei

论文摘要

以人为中心的感知在视觉和图形中起着至关重要的作用。但是他们的数据注释非常昂贵。因此，希望拥有一种多功能的预训练模型，该模型是数据有效的下游任务转移的基础。为此，我们提出了以人为中心的多模式对比学习框架HCmoco，该框架利用人类数据的多模式性质（例如RGB，DEPTH，2D关键点）进行有效的表示学习。该目标面临两个主要挑战：多模式数据的密集预训练，有效地使用了稀疏人类先验。为了应对挑战，我们通过层次学习模态不变的潜在空间设计了新颖的样本内对比度学习和稀疏的结构吸引对比度学习目标，具有连续和有序的特征分布和结构感知的语义一致性。 HCMOCO通过组合异质数据集为不同的模式提供了预训练，从而有效使用了现有的特定任务人类数据。对不同方式的四个下游任务进行了广泛的实验，这表明了HCMOCO的有效性，尤其是在数据效率的设置下（对密集估计和人类解析的7.16％和12％的提高）。此外，我们通过探索跨模式的监督和模式推断，证明了HCMOCO的多功能性，从而验证了其在跨模式关联和推理方面的强大能力。

Human-centric perception plays a vital role in vision and graphics. But their data annotations are prohibitively expensive. Therefore, it is desirable to have a versatile pre-train model that serves as a foundation for data-efficient downstream tasks transfer. To this end, we propose the Human-Centric Multi-Modal Contrastive Learning framework HCMoCo that leverages the multi-modal nature of human data (e.g. RGB, depth, 2D keypoints) for effective representation learning. The objective comes with two main challenges: dense pre-train for multi-modality data, efficient usage of sparse human priors. To tackle the challenges, we design the novel Dense Intra-sample Contrastive Learning and Sparse Structure-aware Contrastive Learning targets by hierarchically learning a modal-invariant latent space featured with continuous and ordinal feature distribution and structure-aware semantic consistency. HCMoCo provides pre-train for different modalities by combining heterogeneous datasets, which allows efficient usage of existing task-specific human data. Extensive experiments on four downstream tasks of different modalities demonstrate the effectiveness of HCMoCo, especially under data-efficient settings (7.16% and 12% improvement on DensePose Estimation and Human Parsing). Moreover, we demonstrate the versatility of HCMoCo by exploring cross-modality supervision and missing-modality inference, validating its strong ability in cross-modal association and reasoning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题