纳米：嵌套的人类在循环奖励学习，用于几次语言模型控制

论文标题

纳米：嵌套的人类在循环奖励学习，用于几次语言模型控制

Nano: Nested Human-in-the-Loop Reward Learning for Few-shot Language Model Control

论文作者

Fan, Xiang, Lyu, Yiwei, Liang, Paul Pu, Salakhutdinov, Ruslan, Morency, Louis-Philippe

论文摘要

审慎的语言模型已在语言生成中表现出非凡的功能。但是，现实世界中的任务通常需要控制生成的文本的分布，以减轻偏见，促进公平和实现个性化。现有的技术控制生成的文本的分布仅适用于量化的分布，这些分布需要预定义的类别，分布比例或按照所需分布的现有语料库。但是，许多重要的分布，例如个人喜好，都没有得到验证。在这项工作中，我们通过提出纳米（NANO）来解决按任意分布（量化和非量化）生成文本的问题，nano是一种不断从人类的反馈中学习的几个人的人类培训算法。与以前的工作相比，Nano在单个主题/属性以及量化的分布控制方面取得了最新的结果。我们还表明，Nano能够学习非量化的分布，实现个性化，并捕获了具有高样本效率的不同个人的个人喜好之间的差异。

Pretrained language models have demonstrated extraordinary capabilities in language generation. However, real-world tasks often require controlling the distribution of generated text in order to mitigate bias, promote fairness, and achieve personalization. Existing techniques for controlling the distribution of generated text only work with quantified distributions, which require pre-defined categories, proportions of the distribution, or an existing corpus following the desired distributions. However, many important distributions, such as personal preferences, are unquantified. In this work, we tackle the problem of generating text following arbitrary distributions (quantified and unquantified) by proposing Nano, a few-shot human-in-the-loop training algorithm that continuously learns from human feedback. Nano achieves state-of-the-art results on single topic/attribute as well as quantified distribution control compared to previous works. We also show that Nano is able to learn unquantified distributions, achieves personalization, and captures differences between different individuals' personal preferences with high sample efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题