助推器：梯度提升决策树的加速器

论文标题

助推器：梯度提升决策树的加速器

Booster: An Accelerator for Gradient Boosting Decision Trees

论文作者

He, Mingxuan, Vijaykumar, T. N., Thottethodi, Mithuna

论文摘要

我们提出了Booster，这是一种基于梯度增强模型的独特特征来梯度增强树的新型加速器。我们观察到，梯度增强训练的主要步骤（占训练时间的90-98％）涉及对小脚印数据结构（例如，累积和比较结构中的值）的简单，细粒度独立的操作。不幸的是，现有的Multicors和GPU无法利用此并行性，因为它们不支持不规则且与数据相关的大规模并行数据结构访问。通过采用可扩展的小型SRALS方法和对SRAM的SRAM带宽保留映射的映射，助推器比Multicores and GPU实现了更大的平行性（例如3200-Way-wayion Paralleleist）。此外，Booster采用冗余数据表示，可显着降低内存带宽需求。我们的模拟表明，在理想的32核多项和理想的GPU上，助推器可以达到11.4倍的加速和6.4倍的速度。基于使用45 nm技术对FPGA验证的RTL的ASIC合成，我们估计助推器芯片占据了60 mm^2的面积，并在1-GHz时钟速度下运行时散发23W。

We propose Booster, a novel accelerator for gradient boosting trees based on the unique characteristics of gradient boosting models. We observe that the dominant steps of gradient boosting training (accounting for 90-98% of training time) involve simple, fine-grained, independent operations on small-footprint data structures (e.g., accumulate and compare values in the structures). Unfortunately, existing multicores and GPUs are unable to harness this parallelism because they do not support massively-parallel data structure accesses that are irregular and data-dependent. By employing a scalable sea-of-small-SRAMs approach and an SRAM bandwidth-preserving mapping of data record fields to the SRAMs, Booster achieves significantly more parallelism (e.g., 3200-way parallelism) than multicores and GPU. In addition, Booster employs a redundant data representation that significantly lowers the memory bandwidth demand. Our simulations reveal that Booster achieves 11.4x speedup and 6.4x speedup over an ideal 32-core multicore and an ideal GPU, respectively. Based on ASIC synthesis of FPGA-validated RTL using 45 nm technology, we estimate a Booster chip to occupy 60 mm^2 of area and dissipate 23 W when operating at 1-GHz clock speed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题