样品的硬度需要量化可靠的评估系统：通过一项新任务探索潜在的机会

论文标题

样品的硬度需要量化可靠的评估系统：通过一项新任务探索潜在的机会

Hardness of Samples Need to be Quantified for a Reliable Evaluation System: Exploring Potential Opportunities with a New Task

论文作者

Mishra, Swaroop, Arunkumar, Anjana, Bryan, Chris, Baral, Chitta

论文摘要

在不了解样品硬度的情况下，对基准上的模型进行评估是不可靠的。随后，这高估了AI系统的能力，并限制了它们在现实世界应用中的采用。我们提出了一个数据评分任务，该任务需要在基准中分配每个未注释的样本的分数在0到1之间的分数，其中0表示轻松，而1表示硬。在我们的任务设计中使用未经注销的样本的灵感来自于人类的灵感，他们可以在不知道其正确答案的情况下确定问题难度。这也排除了涉及基于模型的监督的方法的使用（因为它们需要样本注释才能接受培训），从而消除了与模型相关的潜在偏见。我们为此任务提出了一种基于语义文本相似性（STS）的方法；我们通过证明现有模型相对于更容易的样品厨师而言，与更难的样品块相比，我们验证了我们的方法。最后，我们演示了五个新颖的应用。

Evaluation of models on benchmarks is unreliable without knowing the degree of sample hardness; this subsequently overestimates the capability of AI systems and limits their adoption in real world applications. We propose a Data Scoring task that requires assignment of each unannotated sample in a benchmark a score between 0 to 1, where 0 signifies easy and 1 signifies hard. Use of unannotated samples in our task design is inspired from humans who can determine a question difficulty without knowing its correct answer. This also rules out the use of methods involving model based supervision (since they require sample annotations to get trained), eliminating potential biases associated with models in deciding sample difficulty. We propose a method based on Semantic Textual Similarity (STS) for this task; we validate our method by showing that existing models are more accurate with respect to the easier sample-chunks than with respect to the harder sample-chunks. Finally we demonstrate five novel applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题