论文标题
分布式层次的GPU参数服务器,用于大规模深度学习广告系统
Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems
论文作者
论文摘要
ADS系统的神经网络通常从多个资源中获取输入,例如查询 - AD相关性,AD功能和用户肖像。这些输入被编码为一hot或多热的二进制特征,每个示例通常只有一小部分非零特征值。在线广告行业中的深度学习模型可以具有不适合GPU内存中的Terabyte级参数,也不适合计算节点上的CPU主内存。例如,赞助的在线广告系统可以包含$ 10^{11} $稀疏功能,使神经网络成为具有约10个TB参数的大型模型。在本文中,我们引入了一个分布式的GPU层次参数服务器,用于大规模的深度学习广告系统。我们提出了一个分层工作流,该工作流利用GPU高带宽内存,CPU主内存和SSD作为3层分层存储。所有神经网络训练计算都包含在GPU中。关于现实世界数据的广泛实验证实了拟议系统的有效性和可扩展性。一个4节点层次的GPU参数服务器可以比MPI群集中的150个节点内分布式参数服务器快2倍以上的训练模型。此外,我们所提出的系统的价格绩效比是MPI群集解决方案的4-9倍。
Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.