论文标题

DMR API:通过将应用程序变成可延展性来提高集群生产率

DMR API: Improving cluster productivity by turning applications into malleable

论文作者

Iserte, Sergio, Mayo, Rafael, Quintana-Orti, Enrique S., Beltran, Vicenc, Peña, Antonio J.

论文摘要

自适应工作负载可能会在过程数量上进行 - 即将其作业的配置。为了进行这些工作重新配置,我们设计了一种方法,使工作能够与资源管理器进行通信,并通过运行时更改其MPI等级的数量。两个工作负载管理器之间的协作 - - 意识到作业队列和资源分配 - 以及并行运行时 - - 能够透明地处理流程和程序数据 - 对于我们的吞吐量感知的锻造性方法至关重要。因此,当作业触发重新配置时,资源经理将检查群集状态并返回操作:如果有备用资源,则扩展;如果可以启动排队的工作,则收缩;否则,如果没有变化可以提高全球生产率。在本文中,我们描述了框架的内部内容,以及它如何减少全球工作量完成时间,同时提供对基础资源的更明智使用。为此,我们通过在代表性实验中显示我们的框架的详细行为以及我们的重新配置所涉及的低开销,对自适应工作负载处理进行了详尽的研究。

Adaptive workloads can change on--the--fly the configuration of their jobs, in terms of number of processes. In order to carry out these job reconfigurations, we have designed a methodology which enables a job to communicate with the resource manager and, through the runtime, to change its number of MPI ranks. The collaboration between both the workload manager---aware of the queue of jobs and the resource allocation---and the parallel runtime---able to transparently handle the processes and the program data---is crucial for our throughput-aware malleability methodology. Hence, when a job triggers a reconfiguration, the resource manager will check the cluster status and return an action: an expansion, if there are spare resources; a shrink, if queued jobs can be initiated; or none, if no change can improve the global productivity. In this paper, we describe the internals of our framework and how it is capable of reducing the global workload completion time along with providing a smarter usage of the underlying resources. For this purpose, we present a thorough study of the adaptive workloads processing by showing the detailed behavior of our framework in representative experiments and the low overhead that our reconfiguration involves.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源