推荐：将目标扎根于人群中

论文标题

推荐：将目标扎根于人群中

RefCrowd: Grounding the Target in Crowd with Referring Expressions

论文作者

Qiu, Heqian, Li, Hongliang, Zhao, Taijin, Wang, Lanxiao, Wu, Qingbo, Meng, Fanman

论文摘要

人群的理解由于其重要的实际意义引起了人们对视力领域的普遍兴趣。不幸的是，没有努力探索桥梁自然语言和计算机视觉的多模式领域中的人群理解。参考表达理解（REF）是一项代表性的多模式任务。当前的REF研究更多地关注从一般情况下从多个独特类别中扎根目标对象。很难应用于复杂的现实世界人群的理解。为了填补这一空白，我们提出了一个新的具有挑战性的数据集，称为Refcrowd，该数据集旨在通过参考表情来寻找人群中的目标人。它不仅需要充分挖掘自然语言信息，而且还需要仔细地专注于目标与外观相似的人之间的细微差异，以实现从语言到视觉的细粒度映射。此外，我们提出了一个细粒度的多模式属性对比网络（FMAC），以在人群的理解中处理参考。它首先将复杂的视觉和语言特征分解为属性感知的多模式特征，然后捕获歧视性但稳健性的细粒属性特征，以有效地区分相似人之间的这些细微差异。所提出的方法优于我们的推荐数据集和现有参考数据集中的现有最新方法（SOTA）方法。此外，我们为多模式域中的更深入研究实施了端到端的REF工具箱。我们的数据集和代码可以在：\ url {https://qiuheqian.github.io/datasets/refcrowd/}中可用。

Crowd understanding has aroused the widespread interest in vision domain due to its important practical significance. Unfortunately, there is no effort to explore crowd understanding in multi-modal domain that bridges natural language and computer vision. Referring expression comprehension (REF) is such a representative multi-modal task. Current REF studies focus more on grounding the target object from multiple distinctive categories in general scenarios. It is difficult to applied to complex real-world crowd understanding. To fill this gap, we propose a new challenging dataset, called RefCrowd, which towards looking for the target person in crowd with referring expressions. It not only requires to sufficiently mine the natural language information, but also requires to carefully focus on subtle differences between the target and a crowd of persons with similar appearance, so as to realize the fine-grained mapping from language to vision. Furthermore, we propose a Fine-grained Multi-modal Attribute Contrastive Network (FMAC) to deal with REF in crowd understanding. It first decomposes the intricate visual and language features into attribute-aware multi-modal features, and then captures discriminative but robustness fine-grained attribute features to effectively distinguish these subtle differences between similar persons. The proposed method outperforms existing state-of-the-art (SoTA) methods on our RefCrowd dataset and existing REF datasets. In addition, we implement an end-to-end REF toolbox for the deeper research in multi-modal domain. Our dataset and code can be available at: \url{https://qiuheqian.github.io/datasets/refcrowd/}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题