论文标题
使用回顾性轨迹进行分支和结合优化的强化学习
Reinforcement Learning for Branch-and-Bound Optimisation using Retrospective Trajectories
论文作者
论文摘要
在各种现实世界中,组合优化问题是混合整数线性程序(MILP)的混合整数线性程序(MILP)。规范的分支和结合算法通过构造越来越约束的子问题的搜索树来寻求精确解决MILP。实际上,其解决时间性能取决于启发式方法,例如选择下一个变量来约束(“分支”)。最近,机器学习(ML)已成为分支的有希望的范式。但是,先前的工作一直在努力应用强化学习(RL),理由是稀疏的奖励,艰难的探索和部分可观察性是重大挑战。取而代之的是,领先的ML方法论通过模仿学习(IL)近似于高质量的手工启发式方法,这排除了新型政策的发现,需要昂贵的数据标签。在这项工作中,我们提出了复古分支。用于分支的简单而有效的RL方法。通过回顾性将搜索树解构为子树中包含的多个路径,我们使代理能够从更短的轨迹中学习具有更可预测的下一个状态。在针对四个组合任务的实验中,我们的方法可以在没有任何专家指导或预培训的情况下学习到分支。我们的表现优于当前最新的RL分支算法的3-5倍,而最佳IL方法在MILP上具有500个约束和1000个变量的20%以内,并进行了消融证明我们的回顾性构建轨迹对于实现这些结果至关重要。
Combinatorial optimisation problems framed as mixed integer linear programmes (MILPs) are ubiquitous across a range of real-world applications. The canonical branch-and-bound algorithm seeks to exactly solve MILPs by constructing a search tree of increasingly constrained sub-problems. In practice, its solving time performance is dependent on heuristics, such as the choice of the next variable to constrain ('branching'). Recently, machine learning (ML) has emerged as a promising paradigm for branching. However, prior works have struggled to apply reinforcement learning (RL), citing sparse rewards, difficult exploration, and partial observability as significant challenges. Instead, leading ML methodologies resort to approximating high quality handcrafted heuristics with imitation learning (IL), which precludes the discovery of novel policies and requires expensive data labelling. In this work, we propose retro branching; a simple yet effective approach to RL for branching. By retrospectively deconstructing the search tree into multiple paths each contained within a sub-tree, we enable the agent to learn from shorter trajectories with more predictable next states. In experiments on four combinatorial tasks, our approach enables learning-to-branch without any expert guidance or pre-training. We outperform the current state-of-the-art RL branching algorithm by 3-5x and come within 20% of the best IL method's performance on MILPs with 500 constraints and 1000 variables, with ablations verifying that our retrospectively constructed trajectories are essential to achieving these results.