超越512代币：基于多深度变压器的层次层次编码器，用于长形文档匹配

论文标题

超越512代币：基于多深度变压器的层次层次编码器，用于长形文档匹配

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

论文作者

Yang, Liu, Zhang, Mingyang, Li, Cheng, Bendersky, Michael, Najork, Marc

论文摘要

许多自然语言处理和信息检索问题可以正式作为语义匹配的任务。该领域的现有工作主要集中在短文之间（例如，问答）之间或短文和长文本（例如，临时检索）之间的匹配。长篇文档之间的语义匹配（具有许多重要的应用程序，例如新闻建议，相关的文章建议和文档群集，探索量相对较少，需要更多的研究工作。近年来，在文本匹配的任务中，基于自我注意力的模型已经达到了最先进的表现。但是，这些模型仍然仅限于短文本，例如几句句子或一段，因为自我注意的二次计算复杂性相对于输入文本长度。在本文中，我们通过提出基于长形文档匹配的基于暹罗多深度变压器的层次结构（Smith）编码器来解决问题。我们的模型包含了几项创新，以适应自我注意力的模型，以获得更长的文本输入。为了更好地捕获文档中的句子级别的语义关系，我们除了Bert使用的掩盖单词语言建模任务外，还使用新颖的蒙版句子块语言建模任务预先训练模型。我们在多个基准数据集上进行长形文档匹配的实验结果表明，我们提出的史密斯模型的表现优于先前的最新模型，包括分层注意力，基于多深度注意的基于多深度的基于基于注意力的层次结构复发性神经网络和BERT。与基于BERT的基线相比，我们的模型能够将最大输入文本长度从512增加到2048年。我们将开源一个基于Wikipedia的基于Wikipedia的基准数据集，代码和预先培训的检查站，以加速对长形文档匹配的未来研究。

Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.

下载PDF全文

下载文献需遵守相关版权规定

论文标题