通过知识蒸馏进行CTR预测的定向无环形分解机

论文标题

通过知识蒸馏进行CTR预测的定向无环形分解机

Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation

论文作者

Tian, Zhen, Bai, Ting, Zhang, Zibin, Xu, Zhiyuan, Lin, Kangyi, Wen, Ji-Rong, Zhao, Wayne Xin

论文摘要

随着网络尺度推荐系统中高维稀疏数据的增长，在CTR预测任务中学习高阶功能交互的计算成本大大增加，这限制了在实际工业应用中使用高级交互模型的成本。一些基于知识蒸馏的一些方法将知识从复杂的教师模型转移到浅层学生模型，以加速在线模型推断。但是，它们在知识蒸馏过程中模型准确性的降解症。平衡浅层学生模型的效率和有效性是一项挑战。为了解决此问题，我们提出了一个定向的无环形分解机（KD-DAGFM），以通过知识蒸馏来了解现有复杂交互模型的高阶特征交互。拟议的轻量级学生模型DAGFM可以从教师网络中学习任意的显式特征交互，从而实现了大致无损的性能，并且通过动态编程算法证明。此外，改进的一般模型KD-DAGFM+被证明可以有效地从任何复杂的教师模型中提取显式和隐式特征相互作用。在四个现实世界数据集上进行了广泛的实验，包括来自微信平台的大型工业数据集，具有数十亿个功能维度。 KD-DAGFM在在线和离线实验上的最先进方法的触及量不到21.5％，这表明了DAGFM在CTR预测任务中处理工业规模数据的优势。我们的实施代码可在以下网址获得：https：//github.com/rucaibox/dagfm。

With the growth of high-dimensional sparse data in web-scale recommender systems, the computational cost to learn high-order feature interaction in CTR prediction task largely increases, which limits the use of high-order interaction models in real industrial applications. Some recent knowledge distillation based methods transfer knowledge from complex teacher models to shallow student models for accelerating the online model inference. However, they suffer from the degradation of model accuracy in knowledge distillation process. It is challenging to balance the efficiency and effectiveness of the shallow student models. To address this problem, we propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation. The proposed lightweight student model DAGFM can learn arbitrary explicit feature interactions from teacher networks, which achieves approximately lossless performance and is proved by a dynamic programming algorithm. Besides, an improved general model KD-DAGFM+ is shown to be effective in distilling both explicit and implicit feature interactions from any complex teacher model. Extensive experiments are conducted on four real-world datasets, including a large-scale industrial dataset from WeChat platform with billions of feature dimensions. KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments, showing the superiority of DAGFM to deal with the industrial scale data in CTR prediction task. Our implementation code is available at: https://github.com/RUCAIBox/DAGFM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题