论文标题
数据污染:从记忆到剥削
Data Contamination: From Memorization to Exploitation
论文作者
论文摘要
经过审计的语言模型通常在基于Web的大规模数据集上进行培训,这些数据集通常被下游测试集“污染”。目前尚不清楚模型在多大程度上利用受污染的数据用于下游任务。我们提出了一种研究这个问题的原则方法。我们在Wikipedia的联合语料库上为BERT模型提供了预算,并将其标记为下游数据集,并将它们按照相关任务进行微调。比较在预处理过程中看到和看不见的样品之间的性能,使我们能够定义和量化记忆和剥削水平。使用两个模型和三个下游任务的实验表明,在某些情况下存在剥削,但在其他情况下,模型记住了受污染的数据,但不会利用它。我们表明,这两种措施受不同因素的影响,例如污染数据的重复数量和模型大小。我们的结果强调了分析大规模网络尺度数据集以验证NLP中的进展是通过更好的语言理解而不是更好的数据开发获得的重要性。
Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show that these two measures are affected by different factors such as the number of duplications of the contaminated data and the model size. Our results highlight the importance of analyzing massive web-scale datasets to verify that progress in NLP is obtained by better language understanding and not better data exploitation.