论文标题

数据及其(DIS)内容:对机器学习研究中数据集开发和使用的调查

Data and its (dis)contents: A survey of dataset development and use in machine learning research

论文作者

Paullada, Amandalynne, Raji, Inioluwa Deborah, Bender, Emily M., Denton, Emily, Hanna, Alex

论文摘要

数据集在机器学习研究的发展中发挥了基本作用。它们构成了我们设计和部署的模型的基础,以及我们的基准测试和评估的主要媒介。此外,我们收集,构建和共享这些数据集的方式介绍了该领域所追求的问题以及算法开发中探索的方法。但是,从广度的角度来看,最近的工作揭示了数据集收集和使用中主要实践的局限性。在本文中,我们调查了有关我们在机器学习中收集和使用数据的方式引起的许多担忧,并提倡对数据进行更谨慎,更透彻的理解,以解决该领域的几个实际和道德问题。

Datasets have played a foundational role in the advancement of machine learning research. They form the basis for the models we design and deploy, as well as our primary medium for benchmarking and evaluation. Furthermore, the ways in which we collect, construct and share these datasets inform the kinds of problems the field pursues and the methods explored in algorithm development. However, recent work from a breadth of perspectives has revealed the limitations of predominant practices in dataset collection and use. In this paper, we survey the many concerns raised about the way we collect and use data in machine learning and advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源