预测批次队列工作等待时间，以了解紧急HPC工作负载的知情时间

论文标题

预测批次队列工作等待时间，以了解紧急HPC工作负载的知情时间

Predicting batch queue job wait times for informed scheduling of urgent HPC workloads

论文作者

Brown, Nick, Gibb, Gordon, Belikov, Evgenij, Nash, Rupert

论文摘要

人们对使用HPC机器进行紧急工作负载有越来越多的兴趣，以帮助解决灾难的发展。尽管批处理队列系统在支持此类工作量方面并不理想，但是通过准确预测何时开始运行的工作可以解决许多缺点。但是，以高准确性来实现这样的预测存在许多挑战，尤其是因为队列的状态可以迅速变化并取决于许多因素。在这项工作中，我们探索了一种新颖的机器学习方法，以预测队列等待时间，假设这种模型可以捕获由队列策略和其他交互作用产生的复杂行为，以生成准确的工作开始时间。对于Archer2（HPE Cray EX），Cirrus（HPE 8600）和4-Cabinet（HPE Cray EX），我们探讨了不同的机器学习方法和技术如何提高预测的准确性，与Slurm产生的估计相比。我们证明，我们的技术提供了我们感兴趣的机器中最准确的预测，这项工作的结果是能够在实际开始时间的一分钟内预测工作开始时间，该时间大约有65％的Archer2和4-Cabinet上的工作，而在Cirrus上的工作中有76％。与Slurm可以交付的内容相比，这是Archer2上的3.8倍，而对于Cirrus来说，这是高出18倍。此外，我们的方法可以准确地预测Archer2和4-Cabinet实际开始时间的十分钟内所有工作的四分之三的开始时间，以及在Cirrus上工作的90％。尽管这项工作的驱动力是更好地促进了跨HPC机器的紧急工作负载，但获得的见解可用于为用户提供更大的好处，并丰富现有的批处理队列系统并为政策提供信息。

There is increasing interest in the use of HPC machines for urgent workloads to help tackle disasters as they unfold. Whilst batch queue systems are not ideal in supporting such workloads, many disadvantages can be worked around by accurately predicting when a waiting job will start to run. However there are numerous challenges in achieving such a prediction with high accuracy, not least because the queue's state can change rapidly and depend upon many factors. In this work we explore a novel machine learning approach for predicting queue wait times, hypothesising that such a model can capture the complex behaviour resulting from the queue policy and other interactions to generate accurate job start times. For ARCHER2 (HPE Cray EX), Cirrus (HPE 8600) and 4-cabinet (HPE Cray EX) we explore how different machine learning approaches and techniques improve the accuracy of our predictions, comparing against the estimation generated by Slurm. We demonstrate that our techniques deliver the most accurate predictions across our machines of interest, with the result of this work being the ability to predict job start times within one minute of the actual start time for around 65\% of jobs on ARCHER2 and 4-cabinet, and 76\% of jobs on Cirrus. When compared against what Slurm can deliver, this represents around 3.8 times better accuracy on ARCHER2 and 18 times better for Cirrus. Furthermore our approach can accurately predicting the start time for three quarters of all job within ten minutes of the actual start time on ARCHER2 and 4-cabinet, and for 90\% of jobs on Cirrus. Whilst the driver of this work has been to better facilitate placement of urgent workloads across HPC machines, the insights gained can be used to provide wider benefits to users and also enrich existing batch queue systems and inform policy too.

下载PDF全文

下载文献需遵守相关版权规定

论文标题