论文标题

反复的网络,隐藏状态和在部分可观察的环境中的信念

Recurrent networks, hidden states and beliefs in partially observable environments

论文作者

Lambrechts, Gaspard, Bolland, Adrien, Ernst, Damien

论文摘要

强化学习旨在通过与动态未知的环境相互作用来学习最佳政策。许多方法依赖于价值函数的近似来得出近乎最佳的策略。在部分可观察到的环境中,这些功能取决于观测和过去的动作的完整序列,称为历史记录。在这项工作中,我们从经验上表明,经过训练以近似此类价值函数的复发性神经网络在内部过滤了给定历史的当前状态的后验概率分布,称为信念。更确切地说,我们表明,随着经常性神经网络了解Q功能,其隐藏状态与与最佳控制有关的状态变量的信念越来越相关。这种相关是通过其共同信息来衡量的。此外,我们表明,代理人的预期回报随着其经常性架构在其隐藏状态和信念之间达到高度相互信息的能力而增加。最后,我们表明,隐藏状态与变量的信念之间的相互信息与最佳控制无关的变量通过学习过程降低。总而言之,这项工作表明,在其隐藏状态下,近似于可观察到的环境的Q功能的经常性神经网络从历史上复制了足够的统计量,该统计数据与采取最佳动作的信念相关部分相关。

Reinforcement learning aims to learn optimal policies from interaction with environments whose dynamics are unknown. Many methods rely on the approximation of a value function to derive near-optimal policies. In partially observable environments, these functions depend on the complete sequence of observations and past actions, called the history. In this work, we show empirically that recurrent neural networks trained to approximate such value functions internally filter the posterior probability distribution of the current state given the history, called the belief. More precisely, we show that, as a recurrent neural network learns the Q-function, its hidden states become more and more correlated with the beliefs of state variables that are relevant to optimal control. This correlation is measured through their mutual information. In addition, we show that the expected return of an agent increases with the ability of its recurrent architecture to reach a high mutual information between its hidden states and the beliefs. Finally, we show that the mutual information between the hidden states and the beliefs of variables that are irrelevant for optimal control decreases through the learning process. In summary, this work shows that in its hidden states, a recurrent neural network approximating the Q-function of a partially observable environment reproduces a sufficient statistic from the history that is correlated to the relevant part of the belief for taking optimal actions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源