论文标题
用于网络内聚合的有效数据平面内存计划
Efficient Data-Plane Memory Scheduling for In-Network Aggregation
论文作者
论文摘要
随着分布式培训的规模的增长,沟通变成了瓶颈。为了加速通信,最近的作品引入了网络内聚合(INA),将梯度求和到网络中间框中,例如可编程开关以减少流量量。但是,与分布式训练中传递的梯度的量相比,开关内存稀缺。尽管文献应用了基于池的流或动态共享等方法来应对不匹配,但开关内存仍然是潜在的性能瓶颈。此外,由于在最近的工作中聚集器deadlocation的同步要求,我们观察到开关内存的利用不足。为了改善开关内存利用率,我们建议ESA,$ \ usewinline {e} $ fficient switch内存$ \ usewessline {s} $ cheduler for-network $ \ usevenline {a} $ ggregation。在其内核上,ESA强制执行先发制的聚合分配原始分配,并在数据平面上介绍优先级计划,从而改善了开关内存的利用率和平均作业完成时间(JCT)。实验表明,ESA可以将平均JCT提高到$ 1.35 \ times $。
As the scale of distributed training grows, communication becomes a bottleneck. To accelerate the communication, recent works introduce In-Network Aggregation (INA), which moves the gradients summation into network middle-boxes, e.g., programmable switches to reduce the traffic volume. However, switch memory is scarce compared to the volume of gradients transmitted in distributed training. Although literature applies methods like pool-based streaming or dynamic sharing to tackle the mismatch, switch memory is still a potential performance bottleneck. Furthermore, we observe the under-utilization of switch memory due to the synchronization requirement for aggregator deallocation in recent works. To improve the switch memory utilization, we propose ESA, an $\underline{E}$fficient Switch Memory $\underline{S}$cheduler for In-Network $\underline{A}$ggregation. At its cores, ESA enforces the preemptive aggregator allocation primitive and introduces priority scheduling at the data-plane, which improves the switch memory utilization and average job completion time (JCT). Experiments show that ESA can improve the average JCT by up to $1.35\times$.