迈向机器学习系统的变更分类法

论文标题

迈向机器学习系统的变更分类法

Towards a Change Taxonomy for Machine Learning Systems

论文作者

Bhatia, Aaditya, Eghan, Ellis E., Grichi, Manel, Cavanagh, William G., Ming, Zhen, Jiang, Adams, Bram

论文摘要

机器学习（ML）研究出版物通常在GitHub上提供开源实现，使他们的受众可以复制，验证甚至扩展机器学习算法，数据集和元数据。但是，到目前为止，关于此类ML研究存储库的协作活动程度知之甚少，特别是（1）此类存储库从叉子那里获得贡献的程度，（2）此类贡献的性质（即变更的类型），以及（3）变化的性质，这些变化的性质并未向福克斯造成福特，可能代表遗漏的机会。在本文中，我们对1,346毫升研究存储库及其67,369叉进行了验证，无论是定量还是定性（通过建立Hindle等人的构建代码更改的开创性分类法）。我们发现，尽管ML研究存储库是大量分叉的，但只有9％的叉子对分叉的存储库进行了修改。后者的42％发送给家长存储库的更改，其中一半（52％）被父家存储库接受。我们对539个贡献和378个本地（仅叉）变化的定性分析扩展了Hindle等人的分类法，其中一种与ML（数据）相关的一个新的顶级变更类别和15个新的子类别，包括九个ML特定的ML特异性（包括输入数据，输出数据，程序数据，共享数据，共享数据，分享数据，分享数据，参数评估，参数调查，绩效，绩效，模型，模型，模型，模型，模型，模型，模型，模型，模型，模型，模型。尽管叉子大多不涉及域特定于域的自定义和本地实验（例如参数调整）所做的变化，但原点ML存储库确实错过了15.4％的文档更改的15.4％的文档变化，而13.6％的特征更改和错误修复错误修复更改的11.4％。本文中的发现将对从业者，研究人员，工具匠和教育者有用。

Machine Learning (ML) research publications commonly provide open-source implementations on GitHub, allowing their audience to replicate, validate, or even extend machine learning algorithms, data sets, and metadata. However, thus far little is known about the degree of collaboration activity happening on such ML research repositories, in particular regarding (1) the degree to which such repositories receive contributions from forks, (2) the nature of such contributions (i.e., the types of changes), and (3) the nature of changes that are not contributed back to forks, which might represent missed opportunities. In this paper, we empirically study contributions to 1,346 ML research repositories and their 67,369 forks, both quantitatively and qualitatively (by building on Hindle et al.'s seminal taxonomy of code changes). We found that while ML research repositories are heavily forked, only 9% of the forks made modifications to the forked repository. 42% of the latter sent changes to the parent repositories, half of which (52%) were accepted by the parent repositories. Our qualitative analysis on 539 contributed and 378 local (fork-only) changes, extends Hindle et al.'s taxonomy with one new top-level change category related to ML (Data), and 15 new sub-categories, including nine ML-specific ones (input data, output data, program data, sharing, change evaluation, parameter tuning, performance, pre-processing, model training). While the changes that are not contributed back by the forks mostly concern domain-specific customizations and local experimentation (e.g., parameter tuning), the origin ML repositories do miss out on a non-negligible 15.4% of Documentation changes, 13.6% of Feature changes and 11.4% of Bug fix changes. The findings in this paper will be useful for practitioners, researchers, toolsmiths, and educators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题