论文标题
用于描述机器学习数据集的特定领域的语言
A domain-specific language for describing machine learning datasets
论文作者
论文摘要
数据集在机器学习(ML)模型的培训和评估中起着核心作用。但是它们也是许多不希望的模型行为的根本原因,例如有偏见的预测。为了克服这种情况,ML社区提出了一个以数据为中心的文化转变,在该转变中,数据问题得到了应有的关注,并且围绕数据集的收集和处理的更多标准实践开始讨论和建立。 到目前为止,这些建议主要是自然语言中描述的高级准则,因此,它们很难形式化并应用于特定数据集。从这个意义上讲,受这些建议的启发,我们定义了一种新的特定领域语言(DSL),以精确描述机器学习数据集,以其结构,数据出处和社会关注。我们认为,该DSL将有助于任何ML计划,以利用和受益于ML的这种以数据为中心的转移(例如,为新项目选择最合适的数据集或更好地复制其他ML结果)。 DSL被实现为Visual Studio代码插件,并已根据开源许可发布。
Datasets play a central role in the training and evaluation of machine learning (ML) models. But they are also the root cause of many undesired model behaviors, such as biased predictions. To overcome this situation, the ML community is proposing a data-centric cultural shift where data issues are given the attention they deserve, and more standard practices around the gathering and processing of datasets start to be discussed and established. So far, these proposals are mostly high-level guidelines described in natural language and, as such, they are difficult to formalize and apply to particular datasets. In this sense, and inspired by these proposals, we define a new domain-specific language (DSL) to precisely describe machine learning datasets in terms of their structure, data provenance, and social concerns. We believe this DSL will facilitate any ML initiative to leverage and benefit from this data-centric shift in ML (e.g., selecting the most appropriate dataset for a new project or better replicating other ML results). The DSL is implemented as a Visual Studio Code plugin, and it has been published under an open source license.