论文标题
RL代理人暗中学习人类的偏好
RL agents Implicitly Learning Human Preferences
论文作者
论文摘要
在现实世界中,RL代理应因实现人类偏好而获得奖励。我们表明,RL代理在环境中隐含地学习人类的偏好。训练分类器以预测基于RL代理的神经网络的激活能够实现模拟人类的偏好。93 AUC。在原始环境状态下培训分类器仅获得0.8 AUC。从RL代理的激活中训练分类器也比自动编码器的激活训练要好得多。人类的偏好分类器可以用作RL代理的奖励功能,以使RL代理对人类更有益。
In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human's preferences are fulfilled based on the activations of a RL agent's neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent's activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.