数据中的错误：成像网是如何歪曲生物多样性的

论文标题

数据中的错误：成像网是如何歪曲生物多样性的

Bugs in the Data: How ImageNet Misrepresents Biodiversity

论文作者

Luccioni, Alexandra Sasha, Rolnick, David

论文摘要

ImagEnet-1K是一个数据集，通常用于基准测试机器学习（ML）模型，并评估诸如图像识别和对象检测等任务。野生动物占Imagenet-1k的27％，但与代表人和物体的类别不同，这些数据尚未经过严格的审查。在当前的论文中，我们分析了269个类的13,450张图像，这些图像代表了Imagenet-1k验证集中的野生动物，并参与了专家生态学家。我们发现许多类都是不确定或重叠的，并且图像的12％被错误地标记，某些类的图像> 90％的图像不正确。我们还发现，Imagenet-1K中包含的野生动植物相关标签和图像都表现出明显的地理和文化偏见，以及诸如人造动物，相同图像或人类存在的歧义。我们的发现突出了该数据集的广泛使用来评估ML系统的严重问题，在与野生动植物相关的任务中使用此类算法以及更广泛地创建和策划ML数据集的方式。

ImageNet-1k is a dataset often used for benchmarking machine learning (ML) models and evaluating tasks such as image recognition and object detection. Wild animals make up 27% of ImageNet-1k but, unlike classes representing people and objects, these data have not been closely scrutinized. In the current paper, we analyze the 13,450 images from 269 classes that represent wild animals in the ImageNet-1k validation set, with the participation of expert ecologists. We find that many of the classes are ill-defined or overlapping, and that 12% of the images are incorrectly labeled, with some classes having >90% of images incorrect. We also find that both the wildlife-related labels and images included in ImageNet-1k present significant geographical and cultural biases, as well as ambiguities such as artificial animals, multiple species in the same image, or the presence of humans. Our findings highlight serious issues with the extensive use of this dataset for evaluating ML systems, the use of such algorithms in wildlife-related tasks, and more broadly the ways in which ML datasets are commonly created and curated.

下载PDF全文

下载文献需遵守相关版权规定

论文标题