通过跨体系结构知识蒸馏改善有效的神经排名模型

论文标题

通过跨体系结构知识蒸馏改善有效的神经排名模型

Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

论文作者

Hofstätter, Sebastian, Althammer, Sophia, Schröder, Michael, Sertkan, Mete, Hanbury, Allan

论文摘要

检索和排名模型是许多应用程序的骨干，例如Web搜索，Open Domain QA或基于文本的建议系统。在查询时间，神经排名模型的延迟在很大程度上取决于其设计师的架构和故意选择，以取消效率，以提高效率。这种重点是有效排名架构数量不断增加的低查询延迟，使其适用于生产部署。在机器学习中，越来越常见的方法缩小了更有效的模型的有效性差距，就是将知识蒸馏从大型教师模型应用于较小的学生模型。我们发现，不同的排名架构倾向于以不同的幅度产生产出得分。基于这一发现，我们提出了一个以余量为中心的损失（Margin-MSE）的跨架构培训程序，该程序将知识蒸馏适应不同的BERT和非Bert段落排名体系结构的不同分数输出分布。我们将可教信息作为其他细粒标签应用于MSMARCO-PASSAGE系列的现有培训三元。我们评估了从最先进的串联BERT模型到四个不同有效体系结构（TK，Colbert，Prett和Bert CLS Dot产品模型）的知识的程序。我们表明，在我们评估的体系结构中，我们的保证金知识蒸馏可显着提高重新排列的效率，而不会损害其效率。此外，我们展示了我们的一般蒸馏方法，可以通过Bert Dot产品模型改善最近的基于邻居的索引检索，并使用专业和更昂贵的培训方法提供竞争成果。为了使社区受益，我们在现成的包装中发布教师得分培训文件。

Retrieval and ranking models are the backbone of many applications such as web search, open domain QA, or text-based recommender systems. The latency of neural ranking models at query time is largely dependent on the architecture and deliberate choices by their designers to trade-off effectiveness for higher efficiency. This focus on low query latency of a rising number of efficient ranking architectures make them feasible for production deployment. In machine learning an increasingly common approach to close the effectiveness gap of more efficient models is to apply knowledge distillation from a large teacher model to a smaller student model. We find that different ranking architectures tend to produce output scores in different magnitudes. Based on this finding, we propose a cross-architecture training procedure with a margin focused loss (Margin-MSE), that adapts knowledge distillation to the varying score output distributions of different BERT and non-BERT passage ranking architectures. We apply the teachable information as additional fine-grained labels to existing training triples of the MSMARCO-Passage collection. We evaluate our procedure of distilling knowledge from state-of-the-art concatenated BERT models to four different efficient architectures (TK, ColBERT, PreTT, and a BERT CLS dot product model). We show that across our evaluated architectures our Margin-MSE knowledge distillation significantly improves re-ranking effectiveness without compromising their efficiency. Additionally, we show our general distillation method to improve nearest neighbor based index retrieval with the BERT dot product model, offering competitive results with specialized and much more costly training methods. To benefit the community, we publish the teacher-score training files in a ready-to-use package.

下载PDF全文

下载文献需遵守相关版权规定

论文标题