论文标题
软件遗产图数据集:公共软件开发历史记录的大规模分析
The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History
论文作者
论文摘要
软件遗产是最大的软件源代码公共档案馆和随附的开发历史记录。它涵盖了超过50亿个独特的源代码文件和10亿个独特的提交,来自超过8000万个软件项目。这些软件工件从主要的协作开发平台(例如Github,GitLab)和包装存储库(例如PYPI,PYPI,Debian,NPM)中检索,并存储在统一的表示中,将源代码文件,目录,commits,Commits,Commits和Full Snapshots Reption Control Systems(VCS)crage soptife ass Sottopers carks crage Short Softerce链接在一起。该数据集在可访问性和规模方面是独一无二的,并且可以探索有关公共软件开发长期尾声的许多研究问题,而不是仅仅关注“最明星”的存储库,因为它经常发生。
Software Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits , coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on ''most starred'' repositories as it often happens.