论文标题
哥白尼:表征稀疏工作负载中使用的压缩格式的性能含义
Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads
论文作者
论文摘要
稀疏矩阵是从科学计算到机器学习的几个应用领域的关键要素。稀疏矩阵的主要挑战是有效地存储和传输数据,为此,已经提出了许多稀疏格式以显着消除零条目。这种格式本质上是为了优化内存足迹而设计的,可能在更快的处理方面取得了成功。换句话说,尽管它们允许更快的数据传输并改善内存带宽利用率(稀疏问题的经典挑战),但它们的减压机制可能会产生计算瓶颈。这一挑战不仅没有解决,而且随着特定于域的架构(DSA)的出现,它也变得更加严重,因为它们打算更加积极地提高性能。但是,使用各种格式以及DSA的性能含义尚未通过先前的工作进行广泛研究。为了填补这一知识空白,我们表征了使用七个经常使用的稀疏格式对性能的影响,该格式基于DSA,用于使用高级合成(HLS)工具在FPGA上实现的稀疏基质 - 矢量乘法(SPMV),这是一种不断增长的DSSAS开发DSA的方法。寻求公平的比较,我们为每种格式量身定制并优化了HLS的减压实施。我们彻底探索了各种指标,包括在各种现实世界和合成的稀疏工作负载上,包括减压开销,延迟比率,平衡比,存储器带宽利用率,资源利用率和功耗。
Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in performing faster processing. In other words, although they allow faster data transfer and improve memory bandwidth utilization -- the classic challenge of sparse problems -- their decompression mechanism can potentially create a computation bottleneck. Not only is this challenge not resolved, but also it becomes more serious with the advent of domain-specific architectures (DSAs), as they intend to more aggressively improve performance. The performance implications of using various formats along with DSAs, however, has not been extensively studied by prior work. To fill this gap of knowledge, we characterize the impact of using seven frequently used sparse formats on performance, based on a DSA for sparse matrix-vector multiplication (SpMV), implemented on an FPGA using high-level synthesis (HLS) tools, a growing and popular method for developing DSAs. Seeking a fair comparison, we tailor and optimize the HLS implementation of decompression for each format. We thoroughly explore diverse metrics, including decompression overhead, latency, balance ratio, throughput, memory bandwidth utilization, resource utilization, and power consumption, on a variety of real-world and synthetic sparse workloads.