在模拟分布式环境中的工作流程调度程序的分析

论文标题

在模拟分布式环境中的工作流程调度程序的分析

Analysis of Workflow Schedulers in Simulated Distributed Environments

论文作者

Beránek, Jakub, Böhm, Stanislav, Cima, Vojtěch

论文摘要

任务图提供了一种简单的方法来描述可以在HPC群集和云中执行的科学工作流（具有依赖关系的任务集）。执行此类图的一个重要方面是使用的调度算法。现有作品已经提出了许多安排启发式方法。然而，它们经常在简单化的环境中进行测试。我们提供了用于原型和基准测试任务调度程序设计的可扩展的仿真环境，其中包含各种调度算法的实现，并且是开源的，以便完全重现。我们使用此环境对工作流程调度算法进行全面分析，重点是量化迄今已忽略的调度挑战的效果，例如调度程序调用之间的延迟或部分未知的任务持续时间。我们的结果表明，与更现实的模型相比，许多以前的作品使用的网络模型可能会产生由数量级的结果。此外，我们表明，经常被忽略的计划算法的某些实施详细信息可能会对调度程序的性能产生很大的影响，因此应详细描述它们以进行适当的评估。

Task graphs provide a simple way to describe scientific workflows (sets of tasks with dependencies) that can be executed on both HPC clusters and in the cloud. An important aspect of executing such graphs is the used scheduling algorithm. Many scheduling heuristics have been proposed in existing works; nevertheless, they are often tested in oversimplified environments. We provide an extensible simulation environment designed for prototyping and benchmarking task schedulers, which contains implementations of various scheduling algorithms and is open-sourced, in order to be fully reproducible. We use this environment to perform a comprehensive analysis of workflow scheduling algorithms with a focus on quantifying the effect of scheduling challenges that have so far been mostly neglected, such as delays between scheduler invocations or partially unknown task durations. Our results indicate that network models used by many previous works might produce results that are off by an order of magnitude in comparison to a more realistic model. Additionally, we show that certain implementation details of scheduling algorithms which are often neglected can have a large effect on the scheduler's performance, and they should thus be described in great detail to enable proper evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题