论文标题
薯条:具有交易保证的数据流系统中的快速,一致的运行时重新配置(扩展版)
Fries: Fast and Consistent Runtime Reconfiguration in Dataflow Systems with Transactional Guarantees (Extended Version)
论文作者
论文摘要
大数据系统中的计算作业可能需要很长时间才能运行,尤其是对于数据流上的管道执行。开发人员通常需要更改作业的计算逻辑,例如在操作员中固定漏洞或以更便宜的模型更改运算符中的机器学习模型,以处理数据 - 开发率的突然提高。最近,许多系统已经开始支持运行时重新配置,以允许这种类型的更改即时而无需杀死和重新启动执行。尽管重新配置的延迟对于性能至关重要,但现有系统使用时期进行运行时重新配置,这可能会导致长时间的延迟。在本文中,我们开发了一种称为薯条的新技术,该技术利用了许多系统中快速控制消息的新兴供应,因为这些消息可以发送而不会被数据消息阻止。我们正式定义了运行时重新配置中的一致性,并开发具有一致性保证的薯条调度程序。该技术不仅适用于不同类别的数据流,而且还适用于并行执行并支持容错。我们对集群的广泛实验评估表明,与基于时期的调度程序相比,该技术的优势。
A computing job in a big data system can take a long time to run, especially for pipelined executions on data streams. Developers often need to change the computing logic of the job such as fixing a loophole in an operator or changing the machine learning model in an operator with a cheaper model to handle a sudden increase of the data-ingestion rate. Recently many systems have started supporting runtime reconfigurations to allow this type of change on the fly without killing and restarting the execution. While the delay in reconfiguration is critical to performance, existing systems use epochs to do runtime reconfigurations, which can cause a long delay. In this paper we develop a new technique called Fries that leverages the emerging availability of fast control messages in many systems, since these messages can be sent without being blocked by data messages. We formally define consistency in runtime reconfigurations, and develop a Fries scheduler with consistency guarantees. The technique not only works for different classes of dataflows, but also works for parallel executions and supports fault tolerance. Our extensive experimental evaluation on clusters show the advantages of this technique compared to epoch-based schedulers.