相对于学习分布的集合的典型性测试

论文标题

相对于学习分布的集合的典型性测试

Testing for Typicality with Respect to an Ensemble of Learned Distributions

论文作者

Laine, Forrest, Tomlin, Claire

论文摘要

需要在高维数据集上执行异常检测的方法，因为对数据进行训练的算法仅预计与训练数据相似的数据表现良好。关于检测数据群是否可能来自已知的基本分布的能力，这是理论上的结果，这被称为合适性问题。一个样本方法解决此问题为在线测试提供了重要的计算优势，但需要了解基本分布的模型。在此设置中正确拒绝异常数据的能力取决于基本分布模型的准确性。对于高维数据，学习基本分布的准确模型，使异常检测的起作用非常具有挑战性，正如许多研究人员近年来所指出的那样。一个样本拟合优点问题的现有方法不能说明学习基本分布模型的事实。为了解决这一差距，我们提供了一种理论上动机的方法来说明密度学习程序。特别是，我们建议培训一个密度模型的集合，如果数据相对于任何合奏的任何成员，则数据是异常的。我们为这种方法提供了理论上的理由，首先证明了对典型性的测试是解决拟合优点问题的有效方法，然后证明，对于正确构建的模型集合，典型模型集合的相交位于基本分布的典型集合的内部。我们在合成数据的一个示例中介绍了我们的方法，其中我们认为可以很容易地看到效果。

Methods of performing anomaly detection on high-dimensional data sets are needed, since algorithms which are trained on data are only expected to perform well on data that is similar to the training data. There are theoretical results on the ability to detect if a population of data is likely to come from a known base distribution, which is known as the goodness-of-fit problem. One-sample approaches to this problem offer significant computational advantages for online testing, but require knowing a model of the base distribution. The ability to correctly reject anomalous data in this setting hinges on the accuracy of the model of the base distribution. For high dimensional data, learning an accurate-enough model of the base distribution such that anomaly detection works reliably is very challenging, as many researchers have noted in recent years. Existing methods for the one-sample goodness-of-fit problem do not account for the fact that a model of the base distribution is learned. To address that gap, we offer a theoretically motivated approach to account for the density learning procedure. In particular, we propose training an ensemble of density models, considering data to be anomalous if the data is anomalous with respect to any member of the ensemble. We provide a theoretical justification for this approach, proving first that a test on typicality is a valid approach to the goodness-of-fit problem, and then proving that for a correctly constructed ensemble of models, the intersection of typical sets of the models lies in the interior of the typical set of the base distribution. We present our method in the context of an example on synthetic data in which the effects we consider can easily be seen.

下载PDF全文

下载文献需遵守相关版权规定

论文标题