论文标题
在分布式计算中缓解散曲机的有效复制
Efficient Replication for Straggler Mitigation in Distributed Computing
论文作者
论文摘要
大师工作者分布式计算系统使用任务复制,以减轻慢速工人(称为散乱者)的效果。任务分为批处理,并分配给一个或多个工人进行执行。我们首先考虑批处理不重叠的情况,并且使用主要化理论的结果表明,对于一般的工人服务时间分布,批处理的平衡分配以使工人平衡地将平均职位计算时间最小化。接下来,我们表明,与文献中提出的重叠方案相比,这种非重叠批次的平衡分配达到了平均工作时间较低的时间。此外,我们得出了最佳冗余水平作为工人服务时间分布的函数。我们表明,将平均工作计算时间最小化的冗余水平不一定与最大化工作计算时间可预测性的冗余水平相同,因此在优化两个指标之间存在权衡。最后,通过在Google群集轨迹上运行实验,我们观察到冗余可以减少Google集群中工作时间的计算时间,并且最佳冗余水平取决于任务服务时间的分布。
Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers. Tasks are grouped into batches and assigned to one or more workers for execution. We first consider the case when the batches do not overlap and, using the results from majorization theory, show that, for a general class of workers' service time distributions, a balanced assignment of batches to workers minimizes the average job compute time. We next show that this balanced assignment of non-overlapping batches achieves lower average job compute time compared to the overlapping schemes proposed in the literature. Furthermore, we derive the optimum redundancy level as a function of the service time distribution at workers. We show that the redundancy level that minimizes average job compute time is not necessarily the same as the redundancy level that maximizes the predictability of job compute time, and thus there exists a trade-off between optimizing the two metrics. Finally, by running experiments on Google cluster traces, we observe that redundancy can reduce the compute time of the jobs in Google clusters by an order of magnitude, and that the optimum level of redundancy depends on the distribution of tasks' service time.