论文标题

马尔可夫链蒙特卡洛用于生成排名的文本数据

Markov Chain Monte Carlo for generating ranked textual data

论文作者

Cerqueti, Roy, Ficcadenti, Valerio, Dhesi, Gurjeet, Ausloos, Marcel

论文摘要

本文面临着应用统计和信息科学的中心主题,这是对文本分析中等级大小法律的随机结构的评估。我们通过根据降序顺序根据其频率对其进行排名来考虑这些单词中的单词。起点是,在语言上下文中生成的排名数据可以看作是离散状态马尔可夫链的实现,马尔可夫链的固定分布根据最佳拟合等级大小的法律的离散化行为。使用的方法学工具包是马尔可夫链蒙特卡洛,特别指的是大都会杂货算法。理论框架应用于美国总统演讲中发生的Hapax Legomena的等级大小分析。我们提供大量统计检验,从而实现我们的方法论建议的一致性。为了追求我们的范围,我们还提供了支持的论点,即罕见(``极端'')事件是由无记忆的过程产生的。此外,我们表明,所考虑的样本具有马尔可夫一阶的随机结构。

This paper faces a central theme in applied statistics and information science, which is the assessment of the stochastic structure of rank-size laws in text analysis. We consider the words in a corpus by ranking them on the basis of their frequencies in descending order. The starting point is that the ranked data generated in linguistic contexts can be viewed as the realisations of a discrete states Markov chain, whose stationary distribution behaves according to a discretisation of the best fitted rank-size law. The employed methodological toolkit is Markov Chain Monte Carlo, specifically referring to the Metropolis-Hastings algorithm. The theoretical framework is applied to the rank-size analysis of the hapax legomena occurring in the speeches of the US Presidents. We offer a large number of statistical tests leading to the consistency of our methodological proposal. To pursue our scopes, we also offer arguments supporting that hapaxes are rare (``extreme") events resulting from memory-less-like processes. Moreover, we show that the considered sample has the stochastic structure of a Markov chain of order one. Importantly, we discuss the versatility of the method, which is considered suitable for deducing similar outcomes for other applied science contexts.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源