论文标题
顺序发现的贝叶斯非参数建模
Bayesian nonparametric modelling of sequential discoveries
论文作者
论文摘要
我们旨在以一系列标记的对象进行建模不同标签的外观。此类数据的常见示例包括样本中的语料库或不同物种中的单词。这些顺序发现通常是通过累积曲线概述的,累积曲线计算了在越来越大的对象集中观察到的不同实体的数量。我们通过直接指定新发现的概率,提出了一种用于物种采样建模的新型贝叶斯非参数方法,因此允许灵活规格。对这种方法的渐近行为和有限样本特性进行了广泛的研究。有趣的是,我们放大的顺序过程包括高度易处理的特殊情况。我们提出了一个以吸引人的理论和计算特性为特征的模型子类。此外,由于与逻辑回归模型的牢固联系,后一个子类自然可以解释协变量。我们最终测试了关于合成和真实数据的建议,并特别强调了芬兰的大型真菌生物多样性研究。
We aim at modelling the appearance of distinct tags in a sequence of labelled objects. Common examples of this type of data include words in a corpus or distinct species in a sample. These sequential discoveries are often summarised via accumulation curves, which count the number of distinct entities observed in an increasingly large set of objects. We propose a novel Bayesian nonparametric method for species sampling modelling by directly specifying the probability of a new discovery, therefore allowing for flexible specifications. The asymptotic behavior and finite sample properties of such an approach are extensively studied. Interestingly, our enlarged class of sequential processes includes highly tractable special cases. We present a subclass of models characterized by appealing theoretical and computational properties. Moreover, due to strong connections with logistic regression models, the latter subclass can naturally account for covariates. We finally test our proposal on both synthetic and real data, with special emphasis on a large fungal biodiversity study in Finland.