论文标题
迈向长期和可存档的可重复性
Toward Long-Term and Archivable Reproducibility
论文作者
论文摘要
分析管道通常使用创建时流行的高级技术,但从长远来看不太可能是可读,可执行或可持续的。引入了一组标准来解决以下问题:完整性(除了最小的Unix式操作系统以外没有执行要求,没有管理员特权,没有网络连接,并且主要用纯文本使用);模块化设计;最小的复杂性;可伸缩性;可验证的输入和输出;版本控制;将分析与叙述联系起来;以及免费的开源软件。作为概念证明,我们介绍了“ Maneage”(管理数据谱系),实现了在几个研究出版物中已经测试过的廉价归档,出处提取和同行验证。我们表明,寿命是一个现实的要求,它不会牺牲即时或短期可重复性。然后讨论警告(带有建议的解决方案),并以各种利益相关者的好处得出结论。本文本身就是一个Maneage'D项目(Project Commit 54E4EB2)。
Analysis pipelines commonly use high-level technologies that are popular when created, but are unlikely to be readable, executable, or sustainable in the long term. A set of criteria is introduced to address this problem: Completeness (no execution requirement beyond a minimal Unix-like operating system, no administrator privileges, no network connection, and storage primarily in plain text); modular design; minimal complexity; scalability; verifiable inputs and outputs; version control; linking analysis with narrative; and free and open source software. As a proof of concept, we introduce "Maneage" (Managing data lineage), enabling cheap archiving, provenance extraction, and peer verification that has been tested in several research publications. We show that longevity is a realistic requirement that does not sacrifice immediate or short-term reproducibility. The caveats (with proposed solutions) are then discussed and we conclude with the benefits for the various stakeholders. This article is itself a Maneage'd project (project commit 54e4eb2).