论文标题
快速生成簇数据的可交换序列
Fast Generation of Exchangeable Sequence of Clusters Data
论文作者
论文摘要
贝叶斯模型的随机分区模型的最新进展导致了群集模型可交换序列的制定和探索。在ESC模型下,是可以交换的群集大小,而不是观察本身。该属性对于获得微关注行为特别有用,在该行为中,群集大小在观测值的数量中生长,这在诸如记录链接,稀疏网络和基因组学之类的应用中很常见。不幸的是,可交换的集群属性是以投影率为代价的。结果,与更传统的迪里奇过程或皮特曼 - 尤尔过程混合模型相反,样本不容易以顺序的方式获得ESC模型的先验样品,而是需要使用拒绝或重要性采样。在这项工作中,利用ESC模型与离散续订理论之间的连接,我们获得了某些ESC模型的封闭形式表达式,并开发了与现有技术相比,从这些模型中生成样品的更快方法。在此过程中,我们为ESC模型下簇数的分布建立了分析表达式,这是在此工作之前未知的。
Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast to more traditional Dirichlet Process or Pitman-Yor process mixture models, samples a priori from ESC models cannot be easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was unknown prior to this work.