论文标题
火星:在可重新配置的数据中心网络中,近乎最佳的吞吐量带有浅缓冲区
Mars: Near-Optimal Throughput with Shallow Buffers in Reconfigurable Datacenter Networks
论文作者
论文摘要
大规模计算系统的性能通常取决于高性能通信网络。动态可重新配置的拓扑,例如,基于光电路开关,正在成为一种创新的新技术,以处理数据中心流量的爆炸性增长。具体来说,\ emph {周期性}可重新配置数据中心网络(RDCN),例如Rotornet(Sigcomm 2017),Opera(NSDI 2020)和Sigcomm(Sigcomm 2020)(Sigcomm 2020),已显示出很高的通量,通过散布\ emph {plote a flast a \ emph {plote Graph},通过快速定期ciription Scheptimecting Scheptiming Scheption。 但是,为了实现如此高的吞吐量,现有的可重新配置网络设计付出了很高的代价:就可能的高延迟而言,但也是我们在本文中作为高缓冲区要求而言是本文的首次贡献。特别是,我们表明,在缓冲限制下,模拟高通量完整图的大小是不可行的,并且我们发现了一系列未访问且有吸引力的替代RDCN的频谱,它们模仿了常规图,但比完整的图形较低。 我们提出了火星,这是一种可定期的可重构拓扑,它模拟了$ d $的图表,并具有近乎最佳的吞吐量。特别是,我们系统地分析了如何在数据中心的可用缓冲区和延迟公差的情况下如何优化〜$ d $的程度〜$ d $。我们从经验上进一步证明,与现有系统相比,当缓冲尺寸有限时,火星的吞吐量更高。
The performance of large-scale computing systems often critically depends on high-performance communication networks. Dynamically reconfigurable topologies, e.g., based on optical circuit switches, are emerging as an innovative new technology to deal with the explosive growth of datacenter traffic. Specifically, \emph{periodic} reconfigurable datacenter networks (RDCNs) such as RotorNet (SIGCOMM 2017), Opera (NSDI 2020) and Sirius (SIGCOMM 2020) have been shown to provide high throughput, by emulating a \emph{complete graph} through fast periodic circuit switch scheduling. However, to achieve such a high throughput, existing reconfigurable network designs pay a high price: in terms of potentially high delays, but also, as we show as a first contribution in this paper, in terms of the high buffer requirements. In particular, we show that under buffer constraints, emulating the high-throughput complete graph is infeasible at scale, and we uncover a spectrum of unvisited and attractive alternative RDCNs, which emulate regular graphs, but with lower node degree than the complete graph. We present Mars, a periodic reconfigurable topology which emulates a $d$-regular graph with near-optimal throughput. In particular, we systematically analyze how the degree~$d$ can be optimized for throughput given the available buffer and delay tolerance of the datacenter. We further show empirically that Mars achieves higher throughput compared to existing systems when buffer sizes are bounded.