RL代理人暗中学习人类的偏好

论文标题

RL代理人暗中学习人类的偏好

RL agents Implicitly Learning Human Preferences

论文作者

Wichers, Nevan

论文摘要

在现实世界中，RL代理应因实现人类偏好而获得奖励。我们表明，RL代理在环境中隐含地学习人类的偏好。训练分类器以预测基于RL代理的神经网络的激活能够实现模拟人类的偏好。93 AUC。在原始环境状态下培训分类器仅获得0.8 AUC。从RL代理的激活中训练分类器也比自动编码器的激活训练要好得多。人类的偏好分类器可以用作RL代理的奖励功能，以使RL代理对人类更有益。

In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human's preferences are fulfilled based on the activations of a RL agent's neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent's activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.

下载PDF全文

下载文献需遵守相关版权规定

论文标题