通过社区合并，在Twitter中改进了主题建模

论文标题

通过社区合并，在Twitter中改进了主题建模

Improved Topic modeling in Twitter through Community Pooling

论文作者

Albanese, Federico, Feuerstein, Esteban

论文摘要

社交网络在信息和新闻传播中起着基本作用。表征消息的内容对于不同的任务至关重要，例如打破新闻检测，个性化消息建议，伪造用户检测，信息流表征等。但是，Twitter帖子短，通常比其他文本文档不那么连贯，这使得在这些数据集中应用文本挖掘算法变得具有挑战性。已经显示出推文通行（将推文汇总到更长的文档中）可以改善自动主题分解，但是在此任务中实现的性能取决于汇总方法。在本文中，我们在Twitter中提出了一个新的主题建模的合并方案，该方案的作者属于同一社区的推文（主要是相互互动但与其他组的用户组）在用户交互图上。我们对这种方法，最新计划的状态和以前的合并模型进行了完整的评估，从集群质量，记录检索任务的性能和受监管的机器学习分类得分。结果表明，我们的社区投票方法在两个异质数据集中大多数指标的其他方法都优于其他方法，同时也减少了运行时间。当处理大量嘈杂且用户生成的简短社交媒体文本时，这很有用。总体而言，我们的发现有助于改进的方法，用于识别Twitter数据集中的潜在主题，而无需修改主题分解模型的基本机械。

Social networks play a fundamental role in propagation of information and news. Characterizing the content of the messages becomes vital for different tasks, like breaking news detection, personalized message recommendation, fake users detection, information flow characterization and others. However, Twitter posts are short and often less coherent than other text documents, which makes it challenging to apply text mining algorithms to these datasets efficiently. Tweet-pooling (aggregating tweets into longer documents) has been shown to improve automatic topic decomposition, but the performance achieved in this task varies depending on the pooling method. In this paper, we propose a new pooling scheme for topic modeling in Twitter, which groups tweets whose authors belong to the same community (group of users who mainly interact with each other but not with other groups) on a user interaction graph. We present a complete evaluation of this methodology, state of the art schemes and previous pooling models in terms of the cluster quality, document retrieval tasks performance and supervised machine learning classification score. Results show that our Community polling method outperformed other methods on the majority of metrics in two heterogeneous datasets, while also reducing the running time. This is useful when dealing with big amounts of noisy and short user-generated social media texts. Overall, our findings contribute to an improved methodology for identifying the latent topics in a Twitter dataset, without the need of modifying the basic machinery of a topic decomposition model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题