在现实世界中的限制下评估加强学习流量信号控制的奖励功能

论文标题

在现实世界中的限制下评估加强学习流量信号控制的奖励功能

Assessment of Reward Functions for Reinforcement Learning Traffic Signal Control under Real-World Limitations

论文作者

Cabrejas-Egea, Alvaro, Howell, Shaun, Knutins, Maksis, Connaughton, Colm

论文摘要

自适应交通信号控制是减轻交通拥堵的日益增长后果的关键途径。诸如踏板车和SCAT之类的现有解决方案需要定期且耗时的校准，不能很好地优化多种道路使用方式，并且需要许多实施计划的手动策划。这些方法的最新替代方法是深入的增强学习算法，在该算法中，代理商学习如何为特定的系统状态采取最合适的动作。这是通过近似奖励功能的神经网络来指导的，该奖励功能向代理提供有关采取行动的性能的反馈，从而使其对所选择的特定奖励功能敏感。几位作者调查了文献中使用的奖励功能，但是将结果差异归因于跨作品的奖励功能选择是有问题的，因为有许多不受控制的差异以及不同的结果指标。本文比较了在英国大曼彻斯特的交界处使用不同奖励功能的代理人在各种需求概况中使用不同的奖励功能的性能，但要受现实世界的限制：现实的传感器输入，控制器，校准需求，浏览时间和舞台测序。所考虑的奖励指标是基于停止的时间，浪费时间，浪费的时间，平均速度，队列长度，交界处吞吐量和这些大小的变化。这些奖励功能的性能是根据总等待时间比较的。我们发现，速度最大化导致所有需求水平的平均等待时间最低，比文献中先前引入的其他奖励表现出明显更好的性能。

Adaptive traffic signal control is one key avenue for mitigating the growing consequences of traffic congestion. Incumbent solutions such as SCOOT and SCATS require regular and time-consuming calibration, can't optimise well for multiple road use modalities, and require the manual curation of many implementation plans. A recent alternative to these approaches are deep reinforcement learning algorithms, in which an agent learns how to take the most appropriate action for a given state of the system. This is guided by neural networks approximating a reward function that provides feedback to the agent regarding the performance of the actions taken, making it sensitive to the specific reward function chosen. Several authors have surveyed the reward functions used in the literature, but attributing outcome differences to reward function choice across works is problematic as there are many uncontrolled differences, as well as different outcome metrics. This paper compares the performance of agents using different reward functions in a simulation of a junction in Greater Manchester, UK, across various demand profiles, subject to real world constraints: realistic sensor inputs, controllers, calibrated demand, intergreen times and stage sequencing. The reward metrics considered are based on the time spent stopped, lost time, change in lost time, average speed, queue length, junction throughput and variations of these magnitudes. The performance of these reward functions is compared in terms of total waiting time. We find that speed maximisation resulted in the lowest average waiting times across all demand levels, displaying significantly better performance than other rewards previously introduced in the literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题