论文标题
从小尺度到大尺度:基于距离的密度的复杂数据几何分析
From Small Scales to Large Scales: Distance-to-Measure Density based Geometric Analysis of Complex Data
论文作者
论文摘要
在忽略全球特征的同时,我们如何分辨出具有不同小规模特征的复杂点云?我们能否以允许在这种意义上通过统计保证来区分差异的方式对此类数据进行合适的转换?在本文中,我们考虑了通过单分子定位显微镜获得的复合点云的分析和分类。我们专注于基于小规模特征来识别嘈杂点云之间差异的任务,同时无视大型信息(例如总体大小)。我们通过所谓的距离量化(DTM)函数提出了一种基于数据转换的方法,该函数基于最近的邻居距离的平均值。对于每个数据集,我们估计所有数据点的平均局部距离的概率密度,并使用估计的密度进行分类。虽然适用性是直接的,并且所提出的方法的实际性能非常好,但密度估计器的理论研究非常具有挑战性,因为它们基于I.I.D.通过复杂的转化获得的观察结果。实际上,转化的数据在随机上以非本地的方式依赖,而这种非本地数据不受通常被认为的依赖度量捕获的方式。尽管如此,我们表明密度估计器的渐近行为是由某些I.I.D的核密度估计器驱动的。随机变量通过使用U统计量的理论特性,该变量允许通过Hoffding分解来处理依赖项。我们通过数值研究和用于模拟染色质纤维的单分子定位显微镜数据的应用,该数据基于估计的DTM粘度,无监督分类任务可实现出色的分离结果。
How can we tell complex point clouds with different small scale characteristics apart, while disregarding global features? Can we find a suitable transformation of such data in a way that allows to discriminate between differences in this sense with statistical guarantees? In this paper, we consider the analysis and classification of complex point clouds as they are obtained, e.g., via single molecule localization microscopy. We focus on the task of identifying differences between noisy point clouds based on small scale characteristics, while disregarding large scale information such as overall size. We propose an approach based on a transformation of the data via the so-called Distance-to-Measure (DTM) function, a transformation which is based on the average of nearest neighbor distances. For each data set, we estimate the probability density of average local distances of all data points and use the estimated densities for classification. While the applicability is immediate and the practical performance of the proposed methodology is very good, the theoretical study of the density estimators is quite challenging, as they are based on i.i.d. observations that have been obtained via a complicated transformation. In fact, the transformed data are stochastically dependent in a non-local way that is not captured by commonly considered dependence measures. Nonetheless, we show that the asymptotic behaviour of the density estimator is driven by a kernel density estimator of certain i.i.d. random variables by using theoretical properties of U-statistics, which allows to handle the dependencies via a Hoeffding decomposition. We show via a numerical study and in an application to simulated single molecule localization microscopy data of chromatin fibers that unsupervised classification tasks based on estimated DTM-densities achieve excellent separation results.