论文标题
Groot:生成顺序标记的纠正奖励优化
GROOT: Corrective Reward Optimization for Generative Sequential Labeling
论文作者
论文摘要
顺序标记是一项基本的NLP任务,构成了许多应用程序的骨干。对SEQ2SEQ模型的监督学习在这些问题上表现出了巨大的成功。但是,培训目标仍然与我们在实践中关心的指标和逃亡者显着脱节。例如,实用的序列标记应用程序可能需要优化某些Precision-Recall折衷(TOP-K预测的),这与最大化金标记序列的可能性的标准目标完全不同。因此,为了弥合这一差距,我们提出了Groot - 一种简单而有效的框架,用于生成文本序列的奖励优化。 Groot通过训练生成的顺序标记模型来工作,以将解码器输出分布与(Black-Box)奖励函数的输出分布相匹配。使用迭代培训制度,我们首先生成预测候选者,然后纠正其中的错误,并最终将这些候选者(基于其奖励价值)进行对比。正如通过四个公共基准测试的广泛实验所证明的那样,Groot显着改善了所有奖励指标。此外,Groot会导致整体解码器分布的改善,这是由顶级$ K $候选者的质量提高所证明的。
Sequential labeling is a fundamental NLP task, forming the backbone of many applications. Supervised learning of Seq2Seq models has shown great success on these problems. However, the training objectives are still significantly disconnected with the metrics and desiderata we care about in practice. For example, a practical sequence tagging application may want to optimize for a certain precision-recall trade-off (of the top-k predictions) which is quite different from the standard objective of maximizing the likelihood of the gold labeled sequence. Thus to bridge this gap, we propose GROOT -- a simple yet effective framework for Generative Reward Optimization Of Text sequences. GROOT works by training a generative sequential labeling model to match the decoder output distribution with that of the (black-box) reward function. Using an iterative training regime, we first generate prediction candidates, then correct errors in them, and finally contrast those candidates (based on their reward values). As demonstrated via extensive experiments on four public benchmarks, GROOT significantly improves all reward metrics. Furthermore, GROOT leads to improvements of the overall decoder distribution as evidenced by the quality gains of the top-$k$ candidates.