关于高能物理数据分析的平行HDF5数据集串联串联的案例研究

论文标题

关于高能物理数据分析的平行HDF5数据集串联串联的案例研究

A Case Study on Parallel HDF5 Dataset Concatenation for High Energy Physics Data Analysis

论文作者

Lee, Sunwoo, Hou, Kai-yuan, Wang, Kewei, Sehrish, Saba, Paterno, Marc, Kowalkowski, James, Koziol, Quincey, Ross, Robert, Agrawal, Ankit, Choudhary, Alok, Liao, Wei-keng

论文摘要

在高能量物理（HEP）中，实验者产生大量数据，当经过分析时，这些数据有助于我们更好地了解基本颗粒及其相互作用。这些数据通常在许多小规模的文件中捕获，从而为科学家带来了数据管理挑战。为了更好地促进大规模平台上的数据管理，传输和分析，将数据进一步汇总到较小数量的较大文件中是有利的。但是，这种翻译过程可以消耗大量的时间和资源，如果执行不正确，则在大规模平台上分析期间，所得的汇总文件无法高效地并行访问。在本文中，我们介绍了有关平行I/O策略和HDF5功能的案例研究，以减少数据聚合时间，有效利用压缩以及确保在大规模分析过程中有效访问所得数据。在本案例研究中，我们专注于NOVA检测器数据，这是一个大规模的HEP实验，生成了许多数据。从我们的案例研究中学到的经验教训会为处理类似数据集的处理提供了信息，从而扩大了与此共同数据管理任务相关的社区知识。

In High Energy Physics (HEP), experimentalists generate large volumes of data that, when analyzed, helps us better understand the fundamental particles and their interactions. This data is often captured in many files of small size, creating a data management challenge for scientists. In order to better facilitate data management, transfer, and analysis on large scale platforms, it is advantageous to aggregate data further into a smaller number of larger files. However, this translation process can consume significant time and resources, and if performed incorrectly the resulting aggregated files can be inefficient for highly parallel access during analysis on large scale platforms. In this paper, we present our case study on parallel I/O strategies and HDF5 features for reducing data aggregation time, making effective use of compression, and ensuring efficient access to the resulting data during analysis at scale. We focus on NOvA detector data in this case study, a large-scale HEP experiment generating many terabytes of data. The lessons learned from our case study inform the handling of similar datasets, thus expanding community knowledge related to this common data management task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题