论文标题
通过知识基础解开行动识别
Disentangled Action Recognition with Knowledge Bases
论文作者
论文摘要
视频中的动作通常涉及人类与物体的相互作用。动作标签通常由动词和名词的各种组合组成,但我们可能没有所有可能组合的培训数据。在本文中,我们旨在通过利用知识图的力量来提高构图动作识别模型的概括能力。先前的工作利用了知识图中的动词 - 单词组成动作节点,因此缩放效率低下,因为相对于动词和名词的数量,组成动作节点的数量在四次增长。为了解决这个问题,我们提出了我们的方法:用知识碱(黑暗)解开行动识别,这利用了行动的固有组成。黑暗训练一个分解的模型,首先要为动词和名词提取删除的特征表示,然后使用外部知识图中的关系预测分类权重。动词和名词之间的类型约束是从外部知识库中提取的,并在组成动作时最终应用。黑暗的对象和动词数量具有更好的可扩展性,并且可以在Charades数据集中实现最先进的性能。我们进一步根据Epic-Kitchen数据集提出了一个新的基准分配,该数据集的类别和样本数量更大,并且该基准上的各种模型基准。
Action in video usually involves the interaction of human with objects. Action labels are typically composed of various combinations of verbs and nouns, but we may not have training data for all possible combinations. In this paper, we aim to improve the generalization ability of the compositional action recognition model to novel verbs or novel nouns that are unseen during training time, by leveraging the power of knowledge graphs. Previous work utilizes verb-noun compositional action nodes in the knowledge graph, making it inefficient to scale since the number of compositional action nodes grows quadratically with respect to the number of verbs and nouns. To address this issue, we propose our approach: Disentangled Action Recognition with Knowledge-bases (DARK), which leverages the inherent compositionality of actions. DARK trains a factorized model by first extracting disentangled feature representations for verbs and nouns, and then predicting classification weights using relations in external knowledge graphs. The type constraint between verb and noun is extracted from external knowledge bases and finally applied when composing actions. DARK has better scalability in the number of objects and verbs, and achieves state-of-the-art performance on the Charades dataset. We further propose a new benchmark split based on the Epic-kitchen dataset which is an order of magnitude bigger in the numbers of classes and samples, and benchmark various models on this benchmark.