论文标题
一个解决方案不是您所需要的:通过结构化的最大RL进行的射击次数很少
One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL
论文作者
论文摘要
尽管加强学习算法可以学习复杂任务的有效政策,但这些政策通常会因较小的任务变化而脆弱,尤其是在培训期间未明确提供变化时。解决此问题的一种自然方法是训练在训练任务或环境中手动指定差异的代理。但是,在实际情况下,这可能是不可行的,要么是因为不可能进行扰动,要么是因为尚不清楚如何在不牺牲绩效的情况下选择合适的扰动策略。这项工作的关键见解是,学习完成任务的多种行为可以直接导致行为概括为不同的环境,而无需在训练过程中执行明确的扰动。通过在培训期间,在单个环境中确定任务的多个解决方案,我们的方法可以通过放弃不再有效并采用这些解决方案来推广到新情况。从理论上讲,我们表征了鲁棒性的环境集,该环境由我们的算法产生,并从经验上发现,我们的多样性驱动的方法可以推断到环境和任务的各种变化。
While reinforcement learning algorithms can learn effective policies for complex tasks, these policies are often brittle to even minor task variations, especially when variations are not explicitly provided during training. One natural approach to this problem is to train agents with manually specified variation in the training task or environment. However, this may be infeasible in practical situations, either because making perturbations is not possible, or because it is unclear how to choose suitable perturbation strategies without sacrificing performance. The key insight of this work is that learning diverse behaviors for accomplishing a task can directly lead to behavior that generalizes to varying environments, without needing to perform explicit perturbations during training. By identifying multiple solutions for the task in a single environment during training, our approach can generalize to new situations by abandoning solutions that are no longer effective and adopting those that are. We theoretically characterize a robustness set of environments that arises from our algorithm and empirically find that our diversity-driven approach can extrapolate to various changes in the environment and task.