机器学习和AI系统的数据表示

论文标题

机器学习和AI系统的数据表示

Data Representativity for Machine Learning and AI Systems

论文作者

Clemmensen, Line H., Kjærsgaard, Rune D.

论文摘要

通过机器学习模型从数据绘制推断时，数据表示性至关重要。学者们增加了关注模型中的偏见和公平性，这也与输入数据中的固有偏差有关。但是，在AI系统中适当推断样品的代表性（数据集）的代表性有限。本文回顾了代表性样本的定义和概念，并调查其在科学AI文献中的使用。我们介绍了三个可测量的概念，以帮助集中精力并评估不同的数据样本。此外，我们证明，在构建AI系统时，在输入空间的覆盖范围内，代表性样本之间的对比与模仿目标人群分布的代表性样本特别相关。通过对美国人口普查数据的经验证明，我们评估了这些概念的相对固有品质。最后，我们提出了一个问题的框架，用于创建和记录数据代表性的数据，作为现有数据集文档模板的补充。

Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates.

下载PDF全文

下载文献需遵守相关版权规定

论文标题