论文标题
科学模拟数据的关系感知的多元抽样策略
Relationship-aware Multivariate Sampling Strategy for Scientific Simulation Data
论文作者
论文摘要
随着当前超级计算机的计算能力的增加,科学模拟产生的数据大小正在迅速增长。为了减少存储足迹并促进对此类科学数据集的可扩展后分析,多年来已经提出了各种数据减少/摘要方法。存在采样算法的不同口味以对高分辨率科学数据进行采样,同时保留了随后分析所需的重要数据属性。但是,这些采样算法中的大多数都是为单变量数据而设计的,并符合单个变量的事后分析。在这项工作中,我们提出了一种多元抽样策略,该策略保留了原始变量关系,并可以直接在采样数据上进行不同的多元分析。我们提出的策略利用主组件分析来捕获多元数据的差异,并且可以在单个变量的任何现有最新采样算法的顶部构建。此外,我们还提出了不同数据分配方案(常规和不规则)的变体,以有效地对局部多元关系进行建模。使用两个现实世界的多元数据集,我们证明了我们提出的多元抽样策略在其数据降低功能方面的功效以及易于执行有效的事后多变量分析的易用性。
With the increasing computational power of current supercomputers, the size of data produced by scientific simulations is rapidly growing. To reduce the storage footprint and facilitate scalable post-hoc analyses of such scientific data sets, various data reduction/summarization methods have been proposed over the years. Different flavors of sampling algorithms exist to sample the high-resolution scientific data, while preserving important data properties required for subsequent analyses. However, most of these sampling algorithms are designed for univariate data and cater to post-hoc analyses of single variables. In this work, we propose a multivariate sampling strategy which preserves the original variable relationships and enables different multivariate analyses directly on the sampled data. Our proposed strategy utilizes principal component analysis to capture the variance of multivariate data and can be built on top of any existing state-of-the-art sampling algorithms for single variables. In addition, we also propose variants of different data partitioning schemes (regular and irregular) to efficiently model the local multivariate relationships. Using two real-world multivariate data sets, we demonstrate the efficacy of our proposed multivariate sampling strategy with respect to its data reduction capabilities as well as the ease of performing efficient post-hoc multivariate analyses.