通过预测编码学习不变的世界状态表示

论文标题

通过预测编码学习不变的世界状态表示

Learning Invariant World State Representations with Predictive Coding

论文作者

Ziskind, Avi, Kim, Sujeong, Burachas, Giedrius T.

论文摘要

自我监督的学习方法克服了构建功能更高的AI的关键瓶颈：标记数据的可用性有限。但是，自我监督的体系结构的缺点之一是，他们所学的表示形式是隐式的，很难提取有关编码世界状态的有意义的信息，例如在深度图中编码的视觉场景的3D结构。此外，在视觉域中，这种表示形式很少受到对下游任务至关重要的评估，例如自动驾驶汽车的愿景。在此，我们提出了一个框架，用于在深度感知的背景下评估照明不变性的视觉表示。我们开发了一种新的基于预测性编码的架构和一种混合监督/自我监督的学习方法。我们提出了一种扩展预测编码方法的新型体系结构：预测性的侧面自下而上和自上而下的编码器 - 编码器网络（PRELUDENET），该网络明确地学习了从视频帧中推断和预测深度。在预曲线中，编码器的预测编码层堆栈以自我监督的方式进行训练，而预测解码器则以监督的方式进行训练，以推断或预测深度。我们在新的合成数据集上评估了模型的鲁棒性，在该数据集中，可以对光照条件（例如整体照明和阴影的效果）进行参数调整，同时使世界所有其他方面保持恒定。前肾上腺既可以达到竞争深度推理性能，又可以实现下一个帧预测准确性。我们还展示了这种新的网络体系结构如何与混合措施/自我监督的学习方法相结合，在上述性能和不变性之间达到平衡与照明变化之间的平衡。评估视觉表示的建议框架可以扩展到各种任务域和不变性测试。

Self-supervised learning methods overcome the key bottleneck for building more capable AI: limited availability of labeled data. However, one of the drawbacks of self-supervised architectures is that the representations that they learn are implicit and it is hard to extract meaningful information about the encoded world states, such as 3D structure of the visual scene encoded in a depth map. Moreover, in the visual domain such representations only rarely undergo evaluations that may be critical for downstream tasks, such as vision for autonomous cars. Herein, we propose a framework for evaluating visual representations for illumination invariance in the context of depth perception. We develop a new predictive coding-based architecture and a hybrid fully-supervised/self-supervised learning method. We propose a novel architecture that extends the predictive coding approach: PRedictive Lateral bottom-Up and top-Down Encoder-decoder Network (PreludeNet), which explicitly learns to infer and predict depth from video frames. In PreludeNet, the encoder's stack of predictive coding layers is trained in a self-supervised manner, while the predictive decoder is trained in a supervised manner to infer or predict the depth. We evaluate the robustness of our model on a new synthetic dataset, in which lighting conditions (such as overall illumination, and effect of shadows) can be be parametrically adjusted while keeping all other aspects of the world constant. PreludeNet achieves both competitive depth inference performance and next frame prediction accuracy. We also show how this new network architecture, coupled with the hybrid fully-supervised/self-supervised learning method, achieves balance between the said performance and invariance to changes in lighting. The proposed framework for evaluating visual representations can be extended to diverse task domains and invariance tests.

下载PDF全文

下载文献需遵守相关版权规定

论文标题