论文标题
使用相关主题模型在MAPREDUCE框架中使用相关主题模型收集的主题提取爬行文档收集
Topic Extraction of Crawled Documents Collection using Correlated Topic Model in MapReduce Framework
论文作者
论文摘要
可用的研究文件的巨大增加促使研究人员提出主题模型,以提取文档收集的潜在语义主题。但是,如何提取文档集合的隐藏主题已成为许多主题模型应用程序的关键任务。此外,当文档的大小增加时,传统的主题建模方法会遇到可伸缩性问题。在本文中,在MapReduce框架中实现了具有变异期望最大化算法的相关主题模型,以解决可扩展性问题。所提出的方法利用从公共数字图书馆爬网的数据集。此外,分析了爬行文档的全文,以提高MapReduce CTM的准确性。进行实验以证明所提出的算法的性能。从评估中,所提出的方法在主题连贯性方面具有与MapReduce框架中实现的LDA相似的性能。
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational Expectation-Maximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.