加强学习以使用粗粒奖励排名

论文标题

加强学习以使用粗粒奖励排名

Reinforcement Learning to Rank Using Coarse-grained Rewards

论文作者

Tu, Yiteng, Xu, Zhichao, Yang, Tao, Su, Weihang, Zhou, Yujia, Liu, Yiqun, Lin, Fen, Liu, Qin, Ai, Qingyao

论文摘要

学习排名（LTR）在各种信息检索（IR）任务中起着至关重要的作用。尽管受监督的LTR方法基于细粒度相关性标签（例如，文档级注释）取得了巨大的成功，但它们对昂贵且潜在的偏见注释的依赖限制了可扩展性和与现实目标的一致性。相比之下，粗粒的反馈信号（例如持续时间和会话级别的参与度）更容易访问和负担得起。强化学习（RL）提供了一个有希望的框架，可以使用奖励信号直接优化这些目标，但是大多数现有的强化学习排名（RLTR）方法具有较高的差异和较低的样本效率。在大型语言模型（LLMS）的最新进展中，我们以粗粒奖励重新检查了RLTR的问题，并提出了基于广泛使用的LLMS的RL算法的新RLTR方法。我们系统地比较了大规模LTR基准的各种模型架构和粗粒奖励功能的监督学习和基于RL的方法。实验结果表明，即使具有细粒度的标签，先进的RL方法也可以直接从粗粒奖励中学习，并超过强有力的学习基准。这显示了RLTR对公制排名优化的巨大潜力。

Learning to rank (LTR) plays a crucial role in various Information Retrieval (IR) tasks. Although supervised LTR methods based on fine-grained relevance labels (e.g., document-level annotations) have achieved significant success, their reliance on costly and potentially biased annotations limits scalability and alignment with realistic goals. In contrast, coarse-grained feedback signals, such as duration time and session-level engagement, are more accessible and affordable. Reinforcement Learning (RL) offers a promising framework to directly optimize these objectives using reward signals, but most existing Reinforcement Learning to Rank (RLTR) approaches suffer from high variance and low sample efficiency. Motivated by recent advances in large language models (LLMs), we re-examine the problem of RLTR with coarse-grained rewards and propose new RLTR methods based on widely used RL algorithms for LLMs. We systematically compare supervised learning and RL-based methods across various model architectures and coarse-grained reward functions on large-scale LTR benchmarks. Experimental results demonstrate that advanced RL methods can directly learn from coarse-grained rewards and outperform strong supervised learning baselines even with fine-grained labels. This shows the great potential of RLTR for metric-agnostic ranking optimization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题