具有人类判断的模棱两可的图像，用于强大的视觉事件分类

论文标题

具有人类判断的模棱两可的图像，用于强大的视觉事件分类

Ambiguous Images With Human Judgments for Robust Visual Event Classification

论文作者

Sanders, Kate, Kriz, Reno, Liu, Anqi, Van Durme, Benjamin

论文摘要

当代视觉基准主要考虑人类可以实现几乎完美绩效的任务。但是，经常向人类提供视觉数据，这些数据无法通过100％确定性进行分类，并且在对此数据进行评估时，经过标准视觉基准测试的模型可实现较低的性能。为了解决此问题，我们介绍了一个程序，以创建模棱两可的图像数据集并使用它来生产鱿鱼E（“ squidy”），这是从视频中提取的嘈杂图像的集合。所有图像都用地面真实价值注释，并且通过人类的不确定性判断来注释测试集。我们使用此数据集来表征视觉任务中的人类不确定性，并评估现有的视觉事件分类模型。实验结果表明，现有视觉模型没有足够的能力来为模棱两可的图像提供有意义的输出，并且该性质的数据集可用于通过模型培训和直接评估模型校准来评估和改进此类模型。这些发现激发了大规模模棱两可的数据集创建，并进一步研究着重于嘈杂的视觉数据。

Contemporary vision benchmarks predominantly consider tasks on which humans can achieve near-perfect performance. However, humans are frequently presented with visual data that they cannot classify with 100% certainty, and models trained on standard vision benchmarks achieve low performance when evaluated on this data. To address this issue, we introduce a procedure for creating datasets of ambiguous images and use it to produce SQUID-E ("Squidy"), a collection of noisy images extracted from videos. All images are annotated with ground truth values and a test set is annotated with human uncertainty judgments. We use this dataset to characterize human uncertainty in vision tasks and evaluate existing visual event classification models. Experimental results suggest that existing vision models are not sufficiently equipped to provide meaningful outputs for ambiguous images and that datasets of this nature can be used to assess and improve such models through model training and direct evaluation of model calibration. These findings motivate large-scale ambiguous dataset creation and further research focusing on noisy visual data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题