论文标题
指导对话政策学习无循环中的对抗性学习
Guided Dialog Policy Learning without Adversarial Learning in the Loop
论文作者
论文摘要
强化学习(RL)方法已成为培训高效和有效的对话政策的流行选择。但是,这些方法只有在对话结束时,用户模拟器返回的稀疏和不稳定的奖励信号。此外,奖励信号是由人类专家手动设计的,这需要领域知识。最近,已经提出了许多对抗性学习方法,以学习奖励功能以及对话政策。但是,为了更新对话政策和即时的奖励模型,我们仅限于基于政策奖励的算法,例如增强和PPO。此外,对话代理和奖励模型的交替培训很容易陷入本地Optima或导致模式崩溃。为了克服列出的问题,我们建议将对抗性训练分为两个步骤。首先,我们使用辅助对话生成器来训练歧视者,然后将派生的奖励模型纳入通用的RL方法中,以指导对话政策学习。这种方法适用于政策和非政策RL方法。基于我们的广泛实验,我们可以结论提出的方法:(1)使用上政策和外部RL方法达到了显着的任务成功率; (2)有可能将知识从现有领域转移到新领域。
Reinforcement Learning (RL) methods have emerged as a popular choice for training an efficient and effective dialogue policy. However, these methods suffer from sparse and unstable reward signals returned by a user simulator only when a dialogue finishes. Besides, the reward signal is manually designed by human experts, which requires domain knowledge. Recently, a number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy. However, to alternatively update the dialogue policy and the reward model on the fly, we are limited to policy-gradient-based algorithms, such as REINFORCE and PPO. Moreover, the alternating training of a dialogue agent and the reward model can easily get stuck in local optima or result in mode collapse. To overcome the listed issues, we propose to decompose the adversarial training into two steps. First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning. This approach is applicable to both on-policy and off-policy RL methods. Based on our extensive experimentation, we can conclude the proposed method: (1) achieves a remarkable task success rate using both on-policy and off-policy RL methods; and (2) has the potential to transfer knowledge from existing domains to a new domain.