论文标题

开源ML软件包存储库中有关软件工程实践的实证研究

Empirical Study on the Software Engineering Practices in Open Source ML Package Repositories

论文作者

Xiu, Minke, Eghan, Ellis E., Ming, Zhen, Jiang, Adams, Bram

论文摘要

人工智能(AI)的最新进展,尤其是机器学习(ML),引入了各种实际应用(例如,虚拟的个人助理和自动驾驶汽车),可增强日常用户的体验。但是,像深度学习这样的现代ML技术需要大量的技术专长和资源来开发,培训和部署此类模型,从而有效地重用ML模型。公共ML软件包存储库对从业人员和研究人员进行了这种发现和重复使用,该库将预先培训的模型捆绑到出版包中。由于此类存储库是最近的现象,因此没有关于其当前状态和挑战的经验数据。因此,本文进行了一项探索性研究,该研究分析了两个流行的ML包装存储库TFHUB和PYTORCH HUB的结构和内容,比较了其信息元素(功能和策略),包装组织,软件包组织,软件包管理器功能和用法上下文与流行的软件包存储库(NPM,PYPI和Cran)。通过这些研究,我们确定了共享ML软件包的独特SE实践和挑战。这些发现和含义对于打算使用这些共享ML软件包的数据科学家,研究人员和软件开发人员将很有用。

Recent advances in Artificial Intelligence (AI), especially in Machine Learning (ML), have introduced various practical applications (e.g., virtual personal assistants and autonomous cars) that enhance the experience of everyday users. However, modern ML technologies like Deep Learning require considerable technical expertise and resources to develop, train and deploy such models, making effective reuse of the ML models a necessity. Such discovery and reuse by practitioners and researchers are being addressed by public ML package repositories, which bundle up pre-trained models into packages for publication. Since such repositories are a recent phenomenon, there is no empirical data on their current state and challenges. Hence, this paper conducts an exploratory study that analyzes the structure and contents of two popular ML package repositories, TFHub and PyTorch Hub, comparing their information elements (features and policies), package organization, package manager functionalities and usage contexts against popular software package repositories (npm, PyPI, and CRAN). Through these studies, we have identified unique SE practices and challenges for sharing ML packages. These findings and implications would be useful for data scientists, researchers and software developers who intend to use these shared ML packages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源