论文标题

助推器:梯度提升决策树的加速器

Booster: An Accelerator for Gradient Boosting Decision Trees

论文作者

He, Mingxuan, Vijaykumar, T. N., Thottethodi, Mithuna

论文摘要

我们提出了Booster,这是一种基于梯度增强模型的独特特征来梯度增强树的新型加速器。我们观察到,梯度增强训练的主要步骤(占训练时间的90-98%)涉及对小脚印数据结构(例如,累积和比较结构中的值)的简单,细粒度独立的操作。不幸的是,现有的Multicors和GPU无法利用此并行性,因为它们不支持不规则且与数据相关的大规模并行数据结构访问。通过采用可扩展的小型SRALS方法和对SRAM的SRAM带宽保留映射的映射,助推器比Multicores and GPU实现了更大的平行性(例如3200-Way-wayion Paralleleist)。此外,Booster采用冗余数据表示,可显着降低内存带宽需求。我们的模拟表明,在理想的32核多项和理想的GPU上,助推器可以达到11.4倍的加速和6.4倍的速度。基于使用45 nm技术对FPGA验证的RTL的ASIC合成,我们估计助推器芯片占据了60 mm^2的面积,并在1-GHz时钟速度下运行时散发23W。

We propose Booster, a novel accelerator for gradient boosting trees based on the unique characteristics of gradient boosting models. We observe that the dominant steps of gradient boosting training (accounting for 90-98% of training time) involve simple, fine-grained, independent operations on small-footprint data structures (e.g., accumulate and compare values in the structures). Unfortunately, existing multicores and GPUs are unable to harness this parallelism because they do not support massively-parallel data structure accesses that are irregular and data-dependent. By employing a scalable sea-of-small-SRAMs approach and an SRAM bandwidth-preserving mapping of data record fields to the SRAMs, Booster achieves significantly more parallelism (e.g., 3200-way parallelism) than multicores and GPU. In addition, Booster employs a redundant data representation that significantly lowers the memory bandwidth demand. Our simulations reveal that Booster achieves 11.4x speedup and 6.4x speedup over an ideal 32-core multicore and an ideal GPU, respectively. Based on ASIC synthesis of FPGA-validated RTL using 45 nm technology, we estimate a Booster chip to occupy 60 mm^2 of area and dissipate 23 W when operating at 1-GHz clock speed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源