Dirichlet工艺高斯混合模型中基于亚集群的采样的常见故障模式和深度学习解决方案

论文标题

Dirichlet工艺高斯混合模型中基于亚集群的采样的常见故障模式和深度学习解决方案

Common Failure Modes of Subcluster-based Sampling in Dirichlet Process Gaussian Mixture Models -- and a Deep-learning Solution

论文作者

Winter, Vlad, Dinari, Or, Freifeld, Oren

论文摘要

当未知的簇数时，Dirichlet过程高斯混合模型（DPGMM）通常用于聚集数据。一个主要的DPGMM推理范例依赖于抽样。在这里，我们考虑了已知的最先进的采样器（由Chang和Fisher III（2013）提出，并由Dinari等人（2019年）改进，分析其故障模式，并显示如何改进它，通常会大大急剧改善。具体地说，在该采样器中，每当形成一个新群集时，都会使用两个子集群进行增强，其标签是随机初始化的。随着它们的进化，子集群提出了父集群的分裂。我们表明，随机初始化通常是有问题的，并且会损害原本有效的采样器。具体而言，我们证明这种初始化往往会导致不良的拆分建议和/或在接受所需分裂之前的迭代过多。这会减慢收敛性并会损坏聚类。作为一种补救措施，我们为亚集群初始化子例程提出了两个替换选项。第一个是直观的启发式，而第二个是基于深度学习的。我们表明，所提出的方法可以产生更好的分裂，从而转化为性能，结果和稳定性的实质性改善。

The Dirichlet Process Gaussian Mixture Model (DPGMM) is often used to cluster data when the number of clusters is unknown. One main DPGMM inference paradigm relies on sampling. Here we consider a known state-of-art sampler (proposed by Chang and Fisher III (2013) and improved by Dinari et al. (2019)), analyze its failure modes, and show how to improve it, often drastically. Concretely, in that sampler, whenever a new cluster is formed it is augmented with two subclusters whose labels are initialized at random. Upon their evolution, the subclusters serve to propose a split of the parent cluster. We show that the random initialization is often problematic and hurts the otherwise-effective sampler. Specifically, we demonstrate that this initialization tends to lead to poor split proposals and/or too many iterations before a desired split is accepted. This slows convergence and can damage the clustering. As a remedy, we propose two drop-in-replacement options for the subcluster-initialization subroutine. The first is an intuitive heuristic while the second is based on deep learning. We show that the proposed approach yields better splits, which in turn translate to substantial improvements in performance, results, and stability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题