论文标题
科学数据湖中的天文数据组织,管理和访问
Astronomical data organization, management and access in Scientific Data Lakes
论文作者
论文摘要
由于仪器的发展和改进,望远镜档案中存储的数据量不断增加。通常,档案需要存储在由独立计算中心提供的分布式存储架构上。这样的分布式数据存档需要总体数据管理编排。这样的编排包括处理数据存储和编目的工具,并转向转移集成不同的存储系统和协议,同时意识到数据策略和局部性。此外,它需要常见的授权和身份验证基础结构(AAI)层,该层被最终用户视为单个实体并提供透明的数据访问。 粒子物理的科学领域还使用复杂和分布式数据管理系统。在CERN的大型强子对撞机\(LHC)加速器上的实验每年产生数百次数据。该数据全球分配给合作伙伴网站和用户,使用国家计算设施。开发了几种创新的工具,以成功解决全球LHC计算网格(WLCG)的分布式计算挑战。 在逃生项目和开放科学数据基础架构(DIOS)工作包中进行的工作是使用WLCG背景下开发的工具原型制作科学数据湖,并利用了针对公平标准和开放数据的不同物理学科学学科。我们介绍了如何应用科学数据湖原型来解决天文数据用例。我们介绍了软件堆栈,还讨论了域之间的某些差异。
The data volumes stored in telescope archives is constantly increasing due to the development and improvements in the instrumentation. Often the archives need to be stored over a distributed storage architecture, provided by independent compute centres. Such a distributed data archive requires overarching data management orchestration. Such orchestration comprises of tools which handle data storage and cataloguing, and steering transfers integrating different storage systems and protocols, while being aware of data policies and locality. In addition, it needs a common Authorisation and Authentication Infrastructure (AAI) layer which is perceived as a single entity by end users and provides transparent data access. The scientific domain of particle physics also uses complex and distributed data management systems. The experiments at the Large Hadron Collider\,(LHC) accelerator at CERN generate several hundred petabytes of data per year. This data is globally distributed to partner sites and users using national compute facilities. Several innovative tools were developed to successfully address the distributed computing challenges in the context of the Worldwide LHC Computing Grid (WLCG). The work being carried out in the ESCAPE project and in the Data Infrastructure for Open Science (DIOS) work package is to prototype a Scientific Data Lake using the tools developed in the context of the WLCG, harnessing different physics scientific disciplines addressing FAIR standards and Open Data. We present how the Scientific Data Lake prototype is applied to address astronomical data use cases. We introduce the software stack and also discuss some of the differences between the domains.