论文标题
TLB和PAGEWALK性能在具有大型模具堆放的DRAM CACHE中的多层体系结构中
TLB and Pagewalk Performance in Multicore Architectures with Large Die-Stacked DRAM Cache
论文作者
论文摘要
在这项工作中,我们研究了处理器体系结构(例如X86-64)中虚拟到物理地址翻译的间接费用,该架构使用在硬件中行走的radix树实现了分页的虚拟内存。翻译lookAside缓冲区对于系统性能至关重要,尤其是当应用程序需要更大的内存足迹并采用虚拟化时;但是,TLB的成本可能会导致多次内存访问以检索翻译。已经引入了对超级页面的建筑支持,以增加TLB命中率,但受操作系统找到连续内存的能力的限制。许多先前的研究提出了TLB的设计,以降低错过的费率并减少页面的头顶。但是,这些研究已通过分析进行了模型。此外,为了避免使用分页开销,以进行大型内存工作负载和虚拟化,直接段映射是过程线性虚拟地址空间的一部分,其中包括段寄存器,尽管需要一些应用程序和操作系统修改。最近进化的模具堆积的DRAM技术承诺,按照千兆字节的顺序,具有较高的带宽和大型的最后级别缓存,更靠近处理器。在如此大的缓存中,可以访问的数据量而不会导致TLB故障 - TLB的范围不足。 TLB处于数据访问的关键路径上,并且会产生昂贵的页面步行会阻碍系统性能,尤其是当访问数据是LLC中的缓存时。因此,我们有兴趣探索新颖的地址翻译机制,与堆叠DRAM的大小和潜伏期相称。通过使用基于QEMU的MARSSX86完整系统模拟器来准确模拟多种多层地址翻译结构,我们使用多编程和多线程工作负载对TLBS进行了详细的TLBS研究。
In this work we study the overheads of virtual-to-physical address translation in processor architectures, like x86-64, that implement paged virtual memory using a radix tree which are walked in hardware. Translation Lookaside Buffers are critical to system performance, particularly as applications demand larger memory footprints and with the adoption of virtualization; however the cost of a TLB miss potentially results in multiple memory accesses to retrieve the translation. Architectural support for superpages has been introduced to increase TLB hits but are limited by the operating systems ability to find contiguous memory. Numerous prior studies have proposed TLB designs to lower miss rates and reduce page walk overhead; however, these studies have modeled the behavior analytically. Further, to eschew the paging overhead for big-memory workloads and virtualization, Direct Segment maps part of a process linear virtual address space with segment registers albeit requiring a few application and operating system modifications. The recently evolved die-stacked DRAM technology promises a high bandwidth and large last-level cache, in the order of Gigabytes, closer to the processors. With such large caches the amount of data that can be accessed without causing a TLB fault - the reach of a TLB, is inadequate. TLBs are on the critical path for data accesses and incurring an expensive page walk can hinder system performance, especially when the data being accessed is a cache hit in the LLC. Hence, we are interested in exploring novel address translation mechanisms, commensurate to the size and latency of stacked DRAM. By accurately simulating the multitude of multi-level address translation structures using the QEMU based MARSSx86 full system simulator, we perform detailed study of TLBs in conjunction with the large LLCs using multi-programmed and multi-threaded workloads.