论文标题
通过全面评估和排序板了解长期文档排名模型的性能
Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding
论文作者
论文摘要
我们评估了20个以上的变压器模型,用于对长文档的排名(包括最近经过闪存的Longp模型),并将它们与简单的FirstP基线进行了比较,该基线将相同的模型应用于截断的输入(最多为512个令牌)。我们将MS MARCO文档V1用作主要训练集,并评估了转移的零击和微调模型。 在MS Marco,TREC DLS和Robust04上,No长期模型在NDCG和MRR中的表现超过5%(当所有测试集上平均时)。我们猜想这不是由于模型无法处理较长的上下文,而是由于相关段落的位置偏见,后者的分布偏向文档的开始。我们在某些测试集中发现了这种偏见的直接证据,这促使我们创建了Marco Farrelevant女士(基于Masco段落),在前512个代币中,相关段落不存在。 与标准集合不同的是,我们看到纳入更长的上下文和模型性能的有限可变性(在几个%之内)几乎没有任何好处,对MAS MARCO Farrelevant女士进行了实验,发现了模型之间的巨大差异。第一个模型在零射门和微调方案中大致在随机基线水平上大致执行。包括MaxP和Parade注意在内的简单聚合模型具有良好的零射击精度,但微调几乎没有受益。大多数其他型号的零射击性能(有时在随机基线水平上)的性能较差,但经过微调后,MAXP超过了13-28%。因此,位置偏见不仅会降低处理更长的文档上下文的好处,而且还导致模型过度适合位置偏见,并且当相关段落的分布发生实质性变化时,在零照片设置中表现不佳。我们使我们的软件和数据可用。
We evaluated 20+ Transformer models for ranking of long documents (including recent LongP models trained with FlashAttention) and compared them with a simple FirstP baseline, which applies the same model to the truncated input (at most 512 tokens). We used MS MARCO Documents v1 as a primary training set and evaluated both the zero-shot transferred and fine-tuned models. On MS MARCO, TREC DLs, and Robust04 no long-document model outperformed FirstP by more than 5% in NDCG and MRR (when averaged over all test sets). We conjectured this was not due to models' inability to process long context, but due to a positional bias of relevant passages, whose distribution was skewed towards the beginning of documents. We found direct evidence of this bias in some test sets, which motivated us to create MS MARCO FarRelevant (based on MS MARCO Passages) where the relevant passages were not present among the first 512 tokens. Unlike standard collections where we saw both little benefit from incorporating longer contexts and limited variability in model performance (within a few %), experiments on MS MARCO FarRelevant uncovered dramatic differences among models. The FirstP models performed roughly at the random-baseline level in both zero-shot and fine-tuning scenarios. Simple aggregation models including MaxP and PARADE Attention had good zero-shot accuracy, but benefited little from fine-tuning. Most other models had poor zero-shot performance (sometimes at a random baseline level), but outstripped MaxP by as much as 13-28% after fine-tuning. Thus, the positional bias not only diminishes benefits of processing longer document contexts, but also leads to model overfitting to positional bias and performing poorly in a zero-shot setting when the distribution of relevant passages changes substantially. We make our software and data available.