用于分析时间序列数据的可扩展算法的改进且并行版本

论文标题

用于分析时间序列数据的可扩展算法的改进且并行版本

An Improved and Parallel Version of a Scalable Algorithm for Analyzing Time Series Data

论文作者

Vitalis, Andreas

论文摘要

如今，在包括科学在内的所有分支机构中生产并存储了大量数据。有意义地挖掘这些数据已成为一个巨大的挑战，并且具有最广泛的兴趣。在观测值和其维度的数量上，大小都需要数据挖掘算法具有具有线性或几乎线性变量的时间复杂性。一种这样的算法，请参见计算。物理。社区。 184，2446-2453（2013）将观测值分为一个名为“进度指数”的序列。进度指数顺序跨越高采样密度的不同区域。通过合适的注释，它允许对复杂系统的行为进行紧凑的表示，并在原始数据集中编码。唯一的基本参数是观测值之间的距离概念。在这里，我们介绍构造进度索引的关键步骤的共享存储器并行化，这是完整观测图的最小跨度树的近似值的计算。我们证明，对于多达72个逻辑（CPU）核心，获得了出色的平行效率。此外，我们将三个概念进步介绍给算法，以提高其可控性和进度指数本身的解释性。

Today, very large amounts of data are produced and stored in all branches of society including science. Mining these data meaningfully has become a considerable challenge and is of the broadest possible interest. The size, both in numbers of observations and dimensionality thereof, requires data mining algorithms to possess time complexities with both variables that are linear or nearly linear. One such algorithm, see Comput. Phys. Commun. 184, 2446-2453 (2013), arranges observations into a sequence called the progress index. The progress index steps through distinct regions of high sampling density sequentially. By means of suitable annotations, it allows a compact representation of the behavior of complex systems, which is encoded in the original data set. The only essential parameter is a notion of distance between observations. Here, we present the shared memory parallelization of the key step in constructing the progress index, which is the calculation of an approximation of the minimum spanning tree of the complete graph of observations. We demonstrate that excellent parallel efficiencies are obtained for up to 72 logical (CPU) cores. In addition, we introduce three conceptual advances to the algorithm that improve its controllability and the interpretability of the progress index itself.

下载PDF全文

下载文献需遵守相关版权规定

论文标题