论文标题
企业驱动的开源软件数据集
A Dataset of Enterprise-Driven Open Source Software
论文作者
论文摘要
我们介绍了主要由企业而不是志愿者开发的开源软件数据集。这可以用来解决已知的普遍性问题,还可以对开源业务软件开发进行研究。基于这样的前提:企业员工可能会使用其提供的电子邮件帐户为组织开发的项目做出贡献,我们挖掘了与开放数据源的企业以及白人和黑名单的企业相关的域名,并通过三个启发式方法来识别17,264个企业GitHub项目。我们将其作为数据集提供,详细说明它们的出处和特性。数据集样本的手动评估显示出89%的识别精度。通过探索性数据分析,我们发现项目由多个企业内部人士组成,他们似乎在拉的重量超过其体重,而在一小部分相对较大的项目开发中,独家通过企业内部人士进行了发展。
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.