PASTANET：迈向人类活动知识引擎

论文标题

PASTANET：迈向人类活动知识引擎

PaStaNet: Toward Human Activity Knowledge Engine

论文作者

Li, Yong-Lu, Xu, Liang, Liu, Xinpeng, Huang, Xijie, Xu, Yue, Wang, Shiyi, Fang, Hao-Shu, Ma, Ze, Chen, Mingyang, Lu, Cewu

论文摘要

现有的基于图像的活动理解方法主要采用直接映射，即从图像到活动概念，这可能会遇到自巨大差距以来遇到性能瓶颈。鉴于此，我们提出了一条新的道路：首先推断人类部分，然后推理基于部分语义的活动。人体部分状态（面食）是细粒的语义令牌，例如<手，握住，可以构成活动并帮助我们走向人类活动知识引擎的某些东西。为了充分利用面食的力量，我们构建了一个大规模的知识库面食，其中包含7m+意大利面注释。并提出了两个相应的模型：首先，我们设计了一个名为Activity2VEC的模型以提取面食特征，该模型旨在成为各种活动的一般表示。其次，我们使用基于面食的推理方法来推断活动。由Pastanet促进，我们的方法取得了重大改进，例如6.4和13.9在监督学习中的完整和单次打HICO以及3.2和4.2的映射上，在转移学习中基于V-Coco和基于图像的AVA。代码和数据可在http://hake-mvig.cn/上找到。

Existing image-based activity understanding methods mainly adopt direct mapping, i.e. from image to activity concepts, which may encounter performance bottleneck since the huge gap. In light of this, we propose a new path: infer human part states first and then reason out the activities based on part-level semantics. Human Body Part States (PaSta) are fine-grained action semantic tokens, e.g. <hand, hold, something>, which can compose the activities and help us step toward human activity knowledge engine. To fully utilize the power of PaSta, we build a large-scale knowledge base PaStaNet, which contains 7M+ PaSta annotations. And two corresponding models are proposed: first, we design a model named Activity2Vec to extract PaSta features, which aim to be general representations for various activities. Second, we use a PaSta-based Reasoning method to infer activities. Promoted by PaStaNet, our method achieves significant improvements, e.g. 6.4 and 13.9 mAP on full and one-shot sets of HICO in supervised learning, and 3.2 and 4.2 mAP on V-COCO and images-based AVA in transfer learning. Code and data are available at http://hake-mvig.cn/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题