论文标题
基准中的所有数据集都需要吗?文本分类数据集评估的试点研究
Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification
论文作者
论文摘要
在本文中,我们询问研究问题,即是否需要基准中的所有数据集。我们首先在比较不同系统时首先表征数据集的区分性来解决此问题。在9个数据集和36个系统上进行的实验表明,几个现有的基准数据集对歧视最高得分系统的贡献不大,而这些数据集则表现出令人印象深刻的判别能力。进一步,我们将文本分类任务作为案例研究,研究基于其属性(例如平均句子长度)预测数据集歧视的可能性。我们的初步实验有望表明,鉴于有足够数量的培训实验记录,可以学习有意义的预测因子来估计对看不见的数据集的数据集歧视。我们发布了所有数据集,并在DATALAB上探索的所有数据集:\ url {https://datalab.nlpedia.ai}。
In this paper, we ask the research question of whether all the datasets in the benchmark are necessary. We approach this by first characterizing the distinguishability of datasets when comparing different systems. Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power. We further, taking the text classification task as a case study, investigate the possibility of predicting dataset discrimination based on its properties (e.g., average sentence length). Our preliminary experiments promisingly show that given a sufficient number of training experimental records, a meaningful predictor can be learned to estimate dataset discrimination over unseen datasets. We released all datasets with features explored in this work on DataLab: \url{https://datalab.nlpedia.ai}.