论文标题
学习什么和地点:在没有监督的情况下删除位置和身份跟踪
Learning What and Where: Disentangling Location and Identity Tracking Without Supervision
论文作者
论文摘要
我们的大脑几乎可以轻松地将视觉数据流分解为背景和显着对象。此外,它可以预期对象运动和互动,这是概念计划和推理的关键能力。最近的对象推理数据集(例如Cater)揭示了当前基于视觉的AI系统的基本缺点,尤其是在针对明确的对象表示,对象持久性和对象推理时。在这里,我们介绍了一个自我监督的位置和身份跟踪系统(LOCI),该系统在Cater Tracking Challenge上表现出色。受大脑背侧和腹侧路径的启发,Loci通过处理“什么”和“ Where”的单独的,插槽的编码来解决结合问题。 LOCI的预测编码处理过程会促进主动误差最小化,从而使各个插槽倾向于编码单个对象。对象和对象动力学之间的相互作用是在分离的潜在空间中处理的。通过时间的截断反向传播,加上前瞻性资格积累可以显着加快学习的速度并提高记忆效率。除了在当前基准测试中表现出卓越的性能外,基因座还有效地从视频流中提取对象,并将它们分离成位置和格式塔组件。我们认为,这种分离提供了一种代表,可以促进概念层面的有效计划和推理。
Our brain can almost effortlessly decompose visual data streams into background and salient objects. Moreover, it can anticipate object motion and interactions, which are crucial abilities for conceptual planning and reasoning. Recent object reasoning datasets, such as CATER, have revealed fundamental shortcomings of current vision-based AI systems, particularly when targeting explicit object representations, object permanence, and object reasoning. Here we introduce a self-supervised LOCation and Identity tracking system (Loci), which excels on the CATER tracking challenge. Inspired by the dorsal and ventral pathways in the brain, Loci tackles the binding problem by processing separate, slot-wise encodings of `what' and `where'. Loci's predictive coding-like processing encourages active error minimization, such that individual slots tend to encode individual objects. Interactions between objects and object dynamics are processed in the disentangled latent space. Truncated backpropagation through time combined with forward eligibility accumulation significantly speeds up learning and improves memory efficiency. Besides exhibiting superior performance in current benchmarks, Loci effectively extracts objects from video streams and separates them into location and Gestalt components. We believe that this separation offers a representation that will facilitate effective planning and reasoning on conceptual levels.