论文标题
老师强迫恢复文本生成的奖励功能
Teacher Forcing Recovers Reward Functions for Text Generation
论文作者
论文摘要
强化学习(RL)已被广泛用于文本生成中,以减轻暴露偏见问题或利用非并行数据集。奖励功能在使RL培训成功中起着重要作用。但是,以前的奖励功能通常是特定于任务的稀疏功能,限制了RL的使用。在我们的工作中,我们提出了一种任务不足的方法,该方法直接从受教师强迫训练的模型中得出了逐步的奖励功能。我们还提出了一个简单的修改,以通过我们的诱导奖励功能稳定非并行数据集的RL培训。经验结果表明,我们的方法在几个文本生成任务上优于自我训练和奖励回归方法,从而确认了我们奖励功能的有效性。
Reinforcement learning (RL) has been widely used in text generation to alleviate the exposure bias issue or to utilize non-parallel datasets. The reward function plays an important role in making RL training successful. However, previous reward functions are typically task-specific and sparse, restricting the use of RL. In our work, we propose a task-agnostic approach that derives a step-wise reward function directly from a model trained with teacher forcing. We additionally propose a simple modification to stabilize the RL training on non-parallel datasets with our induced reward function. Empirical results show that our method outperforms self-training and reward regression methods on several text generation tasks, confirming the effectiveness of our reward function.