论文标题
货物:上下文增强关键区域卸载,用于网络结合数据中心工作负载
CARGO : Context Augmented Critical Region Offload for Network-bound datacenter Workloads
论文作者
论文摘要
网络界限应用程序,例如执行OLTP查询的数据库服务器或为动态Web应用程序存储对象的缓存服务器,是消费者和企业每天使用的必需服务。这些服务在大型数据中心运行,需要高概率满足预定义的服务水平目标(SLO)或延迟目标。因此,有效的数据中心应用程序应在功率和性能方面优化其执行。但是,为了支持大规模的数据存储,这些工作负载大量使用指针连接的数据结构(例如哈希表,大型风扇外树,Trie),并且表现出较差的指导和记忆水平并行性。我们的实验表明,由于内存较长的访问延迟,这些工作负载占据了处理器资源(例如Rob条目,RS缓冲区,LS队列条目等),以延长后续请求的处理。延迟执行不仅增加了请求处理延迟,而且还会严重影响应用程序吞吐量和功率效率。为了克服这一限制,我们提出货物,这是一种新颖的机制,可以通过在网络接口卡(NIC)上执行关键路径(NIC)的应用程序来重叠排队延迟和请求处理,同时请求等待处理器资源可用。我们的机制动态标识了关键指令,并包括计算长延迟内存访问所需的寄存器状态。在执行核心开始之前,通常在NIC上执行此上下文的关键区域,从而有效地预取数据。在各种交互式数据中心应用程序中,我们的提案分别提高了延迟,吞吐量和功率效率2.7倍,2.7倍和1.5倍,同时产生了适度的存储量。
Network bound applications, like a database server executing OLTP queries or a caching server storing objects for a dynamic web applications, are essential services that consumers and businesses use daily. These services run on a large datacenters and are required to meet predefined Service Level Objectives (SLO), or latency targets, with high probability. Thus, efficient datacenter applications should optimize their execution in terms of power and performance. However, to support large scale data storage, these workloads make heavy use of pointer connected data structures (e.g., hash table, large fan-out tree, trie) and exhibit poor instruction and memory level parallelism. Our experiments show that due to long memory access latency, these workloads occupy processor resources (e.g., ROB entries, RS buffers, LS queue entries etc.) for a prolonged period of time that delay the processing of subsequent requests. Delayed execution not only increases request processing latency, but also severely effects an application throughput and power-efficiency. To overcome this limitation, we present CARGO, a novel mechanism to overlap queuing latency and request processing by executing select instructions on an application critical path at the network interface card (NIC) while requests wait for processor resources to become available. Our mechanism dynamically identifies the critical instructions and includes the register state needed to compute the long latency memory accesses. This context-augmented critical region is often executed at the NIC well before execution begins at the core, effectively prefetching the data ahead of time. Across a variety of interactive datacenter applications, our proposal improves latency, throughput, and power efficiency by 2.7X, 2.7X, and 1.5X, respectively, while incurring a modest amount storage overhead.